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-Abstract- 

We optimize multiway equijoins on relational tables using degree information. We give a new 
bound that uses degree information to more tightly bound the maximum output size of a query. 
On real data, our bound on the number of triangles in a social network can be up to 95 times 
tighter than existing worst case bounds. We show that using only a constant amount of degree 
information, we are able to obtain join algorithms with a running time that has a smaller exponent 
than existing algorithms-/or any database instance. We also show that this degree information 
can be obtained in nearly linear time, which yields asymptotically faster algorithms in the serial 
setting and lower communication algorithms in the MapReduce setting. 

In the serial setting, the data complexity of join processing can be expressed as a function 
©(IN 1 + OUT) in terms of input size IN and output size OUT in which x depends on the query. 
An upper bound for x is given by fractional liypertreewidth. We are interested in situations in 
which we can get algorithms for which x is strictly smaller than the fractional hypertreewidth. We 
say that a join can be processed in subquadratic time if x < 2. Building on the AYZ algorithm for 
processing cycle joins in quadratic time, for a restricted class of joins which we call 1-series-parallel 
graphs, we obtain a complete decision procedure for identifying subquadratic solvability (subject 
to the 3-SUM problem requiring quadratic time). Our 3-SUM based quadratic lower bound is 
tight, making it the only known tight bound for joins that does not require any assumption 
about the matrix multiplication exponent w. We also give a MapReduce algorithm that meets 
our improved communication bound and handles essentially optimal parallelism. 
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1 Introduction 


We study query evaluation for natural join queries. Traditional database systems process 
joins in a pairwise fashion (two tables at a time), but recently a new breed of multiway join 
algorithms have been developed that satisfy stronger runtime guarantees. In the sequential 
setting, worst-case-optimal sequential algorithms such as NPRR [l6,17 or LFTJ 


18 


process 

the join in runtime that is upper bounded by the largest possible output size, a stronger 
guarantee than what traditional optimizers provide. In MapReduce settings (described in 
Appendix |A.2[ ), the Shares algorithm |2|[l3 (described in Appendix A.31 processes multiway 
joins with optimal communication complexity on skew free data. However, traditional 
database systems have developed sophisticated techniques to improve query performance. 
One popular technique used by commercial database systems is to collect “statistics”: 
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auxiliary information about data, such as relation sizes, histograms, and counts of distinct 
different attribute values. Using this information helps the system better estimate the size of 
a join’s output and the runtimes of different query plans, and make better choices of plans. 
Motivated by the use of statistics in query processing, we consider how statistics can improve 
the new breed of multiway join algorithms in sequential and parallel settings. 

We consider the first natural choice for such statistics about the data: the degree. The 
degree of a value in a table is the number of rows in which that value occurs in that table. 
We describe a simple preprocessing technique to facilitate the use of degree information, and 
demonstrate its value through three applications: i) An improved output size bound ii) An 
improved sequential join algorithm iii) An improved MapReduce join algorithm. Each of 
these applications has an improved exponent relative to their corresponding state-of-the-art 
versions [5][8j[l6118]. 

Our key technique is what we call degree-uniformization. Assume for the moment that 
we know the degree of each value in each relation, we then partition each relation by degree 
of each of its attributes. In particular, we assign each degree to a bucket using a parameter 
L : we create one bucket for degrees in [1 ,L), one for degrees in [ L,L 2 ), and so on. We then 
place each tuple in every relation into a partition based on the degree buckets for each of 
its attribute values. The join problem then naturally splits into smaller join problems; each 
smaller problem consisting of a join using one partition from each relation. Let IN denote 
the input size, if we set L = IN C for some constant c, say -y, the number of smaller joins 
we process will be exponential in the number of relations- but constant with respect to the 
data size IN. Intuitively, the benefit of joining partitions separately is that each partition 
will have more information about the input and will have reduced skew. We show that by 
setting L appropriately this scheme allows us to get tighter AGM-like bounds. 

Now we consider a concrete example. Suppose we have a d-regular graph with N edges; 
the number of triangles in the graph is bounded by min(A r d, ^-) by our degree-based bound 
and by N 3 / 2 by the AGM bound. In the worst case, d = y/N and our bound matches the 
AGM bound. But for other degrees, we do much better; better even than simply “summing” 
the AGM bounds over each combination of partitions. Table [T] compares our bound (MO) 
with the AGM bound for the triangle join on social networks from the SNAP datasets [14 . 
‘M’ in the table stands for millions. The last column shows the ratio of the AGM bound to 
our bound; our bound is tighter by a factor of llx to 95x. We could not compare the bounds 
on the Facebook network, but if the number of friends per user is < 5000, our bound is at 
least 450x tighter than the AGM bound. 


We further use degree uniformiz¬ 
ation as a tool to develop algorithms 
that satisfy stronger runtime and 
communication guarantees. De¬ 
gree uniformization allows us to get 

Table 1 Triangle bounds on various social networks runtimes with a better exponent 

than existing algorithms, while re¬ 
quiring only linear time preprocessing on the data. We demonstrate our idea in both the 
serial and parallel (MapReduce) setting, and we now describe each in turn. 


Network 

MO Bound 

AGM Bound 

AGM 

MO 

Twitter 

225 M 

3764 M 

17 

Epinions 

33 M 

362 M 

11 

Live Journal 

6128A7 

573062 M 

95 


Serial Join Algorithms: We use our degree-uniformization to derive new cases in which 
one can obtain subquadratic algorithms for join processing. More precisely, let IN denote the 
size of the input, and OUT denote the size of the output. Then the runtime of an algorithm 
on a query Q can be written as 0(IN x + OUT) for some x. Note that x > 1 for all algorithms 
and queries in this model as we must read the input to answer the query. If the query is 
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cr-acyclic, Yannakakis’ algorithm 19 achieves x = 1. If the query has fractional hypertree 
width (fhw), a recent generalization of tree width 11 , equal to 2, then we can achieve x = 2 


using a combination of algorithms like NPRR and LFTJ with Yannakakis’ algorithm. In this 
work, we focus on cases for which x < 2, which we call subquadratic algorithms. Subquadratic 
algorithms are interesting creatures in their own right, but they may provide tools to attack 
the common case in join processing in which OUT is smaller than IN. 

Our work builds on the classical AYZ algorithm [4] , which derives subquadratic algorithms 
for cycles using degree information. This is a better result than the one achieved by the 
fhw result since the fhw value of length > 4 cycles is already = 2. This result is specific to 
cycles, raising the question: “ Which joins are solvable in subquadratic time?” Technically, 
the AYZ algorithm makes use of properties of cycles in their result and of “heavy and light” 
nodes (high degree and low degree, respectively). We show that degree-uniformization is a 
generalization of this method, and that it allows us to derive subquadratic algorithms for a 
larger family of joins. We devise a procedure to upper bound the processing time of a join, 
and an algorithm to match this upper bound. Our procedure improves the runtime exponent 
x relative to existing work, for a large family of joins. Moreover, for a class of graphs that we 
call 1-series-parallel graphsQwe completely resolve the subquadratic question in the following 
sense: For each 1-series-parallel graph, we can either solve it in subquadratic time, or we show 
that it cannot be solved subquadratically unless the 3-SUM problem |6 (see Appendix A .61 
can be solved in subquadratic time. Note that 1-series-parallel graphs have fhw equal to 2. 
Hence, they can all be solved in quadratic time using existing algorithms; making our 3-SUM 

4 

based lower bound tight. There is a known 3-SUM based lower bound of Ns on triangle join 
processing, which only has a matching upper bound under the assumption that the matrix 
multiplication exponent w = 2. In contrast, our quadratic lower bound can be matched by 
existing algorithms without any assumptions on uj. To our knowledge, this makes it the only 
known tight bound on join processing time for small output sizes. 


We also recover our sequential join results within the well-known GHD framework 11 


We do this using a novel notion of width, which we call m-width, that is no larger than fhw, 

(see Appendix |E.5|. While we resolve the 


and sometimes smaller than submodular width 15 


subquadratic problem on 1-series-parallel graphs, the general subquadratic problem remains 
open. We show that known notions of widths, such as submodular width and m-width do 
not fully characterize subquadratically solvable joins (see Appendix E.6l. 


Joins on MapReduce: Degree information can also be used to improve the efficiency of 
joins on MapReduce. Previous work by Beame et al. [8] uses knowledge of heavy hitters 
(values with high degree) to improve parallel join processing on skewed data. It allows a 
limited range of parallelism (number of processors p < v^IN), but subject to that achieves 
optimal communication for 1-round MapReduce algorithms. We use degree information 
to allow all levels of parallelism (p > 1) while processing the join. We also obtain an 
improved degree-based upper bound on output size that can be significantly better than the 
AGM bound even on simple queries. Our improved parallel algorithm takes three rounds of 
MapReduce, matches our improved bound, and out-performs the optimal 1-round algorithm 
in several cases. As an example, our improved bound lets us correctly upper bound the output 
of a sparse triangle join (where each value has degree 0(1)) by IN instead of IN 5 as suggested 
by the AGM bound. Moreover, we can process the join at maximum levels of parallelism 
(with each processor handling only 0(1) tuples) at a total communication cost of O(IN); 


i 


A 1-series-parallel graph consists of a source vertex s, a target vertex t, and a set of paths of any length 
from s to t, which do not share any nodes other than s and t. 









4 


It’s all a matter of degree: Using degree information to optimize multiway joins 


in contrast to previous work which requires 0(IN 5 ) communication. Furthermore, previous 
work j8] uses edge packings to bound the communication cost of processing a join. Edge 
packings have the paradoxical property that adding information on the size of subrelations 
by adding the subrelations into the join can make the communication cost larger. As an 
example suppose a join has a relation R, with an attribute A in its schema. Adding tta(R) to 
the set of relations to be joined does not change the join output. However, adding a weight 
term for subrelation tta(R) in the edge packing linear program increases its communication 
cost bound. In contrast, if we add tta(R) into the join, our degree based bound does not 
increase, and will in fact decrease if |7 ta(.R)| is small enough. 

Computing Degree Information: In some cases, degree information is not available 
beforehand or is out of date. In such a case, we show a simple way to compute the degrees 
of all values in time linear in the input size. Moreover, the degree computation procedure 
can be fully parallelized in MapReduce. Even after including the complexity of computing 
degrees, our algorithms outperform state of the art join algorithms. 

Our paper is structured as follows: 

• In Section [2] we describe related work. 

• In Section [3] we describe a process called degree-uniformization , which mitigates skew. 
We show the MO bound on join output size that strengthens the exponent in the AGM 
bound, and describe a method to compute the degrees of all attributes in all relations. 

• In Section [4] we present DARTS, our sequential algorithm that achieves tighter runtime 
exponents than state-of-the-art. We use DARTs to process several joins in subquadratic 
time. Then we establish a quadratic runtime lower bound for a certain class of queries 
modulo the 3-SUM problem. Finally we recover the results of DARTS within the familiar 
GHD framework, using a novel notion of width (?n-width) that is tighter than fhw. 

• In Section [5j we present another bound with a tighter exponent than AGM (the DBP 
bound), and a tunable parallel algorithm whose communication cost at maximum paral¬ 
lelism equals the input size plus the DBP bound. The algorithm’s guarantees work on all 
inputs independent of skew. 


2 Related Work 


We divide related work into four broad categories: 

New join algorithms and implementation: The AGM bound [ 5 ] is tight on the output 
size of a multiway join in terms of the query structure and sizes of relations in the query. 

and Generic Join 


Several existing join algorithms, such as NPRR 16 , LFTJ 18 


17 , have 


worst case runtime equal to this bound. However, there exist instances of relations where the 
output size is significantly smaller than the worst-case output size (given by the AGM bound), 
and the above algorithms can have a higher cost than the output size. We demonstrate 
a bound on output size that has a tighter exponent than the AGM bound by taking into 
account information on degrees of values, and match it with a parallelizable algorithm. 

On a-acyclic queries, Yannakakis’ algorithm 119 is instance optimal up to a constant 
multiplicative factor. That is, its cost is 0(IN + OUT) where IN is the input size. For 
cyclic queries, we can combine Yannakakis’ algorithm with the worst-case optimal algorithms 
like NPRR to get a better performance than that of NPRR alone. This is done using 
Generalized Hypertree decompositions (GHDS) (To, 11 of the query to answer the query in 
time 0(IN fhw + OUT) where fhw is a measure of cyclicity of the query. A query is a-acyclic 
if and only if its fhw is one. Our work allows us to obtain a tighter runtime exponent than 
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fhw by dealing with values of different degrees separately. 

Parallel join algorithms: The Shares (2 algorithm is the optimal one round algorithm for 
skew free databases, matching the lower bound of Beame et al. 7 . But its communication 
cost can be much worse than optimal when skew is present. Beame’s work 8] deals with skew 
and is optimal among 1-round algorithms when skew is present. The GYM |1| algorithm 
shows that allowing log(n) rounds of MapReduce instead of just one round can significantly 
reduce cost. Allowing n rounds can reduce it even further. Our work shows that merely going 
from one to three rounds can by itself significantly improve on existing 1-round algorithms. 
Our parallel algorithm can be incorporated into Step 1 of GYM as well, thereby reducing its 
communication cost. 


Using Database Statistics: The cycle detection algorithm by Alon, Yuster and Zwick |4] 
can improve on the fhw bound by using degree information in a sequential setting. Specifically, 
the fhw of a cycle is two but the AYZ algorithm [4] can process a cycle join in time 
0(IN 2 ~ e + OUT) where e > 0 is a function of the cycle length. We generalize this, obtaining 
subquadratic runtime for a larger family of graphs, and develop a general procedure for 
upper bounding the cost of a join by dealing with different degree values separately. 

Beame et al.’s work [8] also uses degree information for parallel join processing. Specifically, 
it assumes that all heavy hitters (values with high degree) and their degrees are known 
beforehand, and processes them separately to get optimal 1-round results. Their work uses 
edge packings to bound the cost of their algorithm. Edge packings have the counterintuitive 
property that adding more constraints, or more information on subrelation sizes, can worsen 
the edge packing cost. This suggests that edge packings alone do not provide the right 
framework for taking degree information into account. Our work remedies this, and the 
performance of our algorithm improves when more constraints are added. In addition, Beame 
et al. j§j assume that M > p 2 where M is relation size and p is the number of processors. 
Thus, their algorithm cannot be maximally parallelized. In contrast, our algorithm can work 
at all levels of parallelism, ranging from one in which each processor gets only 0(1) tuples to 
one in which a single processor does all the processing. 


Degree Uniformization: The partitioning technique of Alon et al. [3| is similar to our 
degree-uniformization technique, but has stronger guarantees at a higher cost. It splits a 
relation into ‘parts’ where the maximum degree of any attribute set A in each part P is 
within a constant factor of the average degree of A in P. In contrast, degree-uniformization 
lets us upper bound the maximum degree of A in P in absolute terms, but not relative to 
the average degree of A in P. 


Marx’s work 15 uses a stronger partitioning technique to fully characterize the fixed- 
parameter tractability of joins in terms of the submodular width of their lrypergraphs. Marx 
achieves degree-uniformity within all small projections of the output, while we only achieve 
uniform degrees within relations. Marx’s preprocessing is expensive; the technique as written 

takes time U(IN 2c ) where c is the submodular width of the join 


15 


in Section 4 of his paper 
lrypergraplr. This preprocessing is potentially more expensive than the join processing itself. 
Our algorithms run in time 0(IN MW ) with MW < c for several joins. Marx did not attempt 
to minimize this exponent, as his application was concerned with fixed parameter tractability. 
We were unable to find an easy way to achieve 0(IN c ) runtime for Marx’s technique. 
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3 Degree Uniformization 

We describe our algorithms for degree-uniformization and counting, as well as our improved 
output size bound. Section [3d] introduces our notation. Section pO] gives a high-level overview 
of our join algorithms. Then, we describe the degree-uniformization which is a key step in our 
algorithms. In Section [3.3[ we describe the MO bound, an upper bound on join output size 
that has a tighter exponent than the AGM bound. We provide realistic examples in which 
the MO bound is much tighter than the AGM bound. Finally, in Section [3~i] we describe a 
linear time algorithm for computing degrees. 

3.1 Preliminaries and Notation 

Throughout the paper we consider a multiway join. Let 1Z be the set of relations in the join 
and A be the set of all attributes in those relations’ schemas. For any relation R, we let attr(7?) 
denote the set of attributes in the schema of R. We wish to process the join Mr g i ^ R, defined 
as the set of tuples t such that Vi? £ TZ : 7r attr (R) (t) £ R. |i?| denotes the number of tuples in 
relation R. For any set of attributes A C A, a value in attribute set A is defined as a tuple 
from U/je 7 Z-ACattr(.R) TaC-R)- For any A C attr(i?), the degree of a value v in A in relation R 
is given by the number of times v occurs in R i.e. deg(w, f?, A) = \ {t £ R \ tta( t) = w} |. For 
all values v of A in i?, we must have deg(u, f?, A) > 1. 

In Section [4] we denote a join query with a hypergraph G; the vertices in the graph 
correspond to attributes and the hyperedges to relations. We use R(Xi, X 2 , ■ ■ ■ ,Xk) to 
denote a relation R having schema (A”i, X 2 , ..., Xk). IN denotes the input size i.e. sum 
of sizes of input relations, while OUT denotes the output size. Our output size bounds, 
computation costs, and communication costs will be expressed using O notation which hides 
polylogarithmic factors i.e. log c (IN), for some c not dependent on number of tuples IN (but 
possibly dependent on the number of relations/attributes). All ensuing logarithms in the 
paper, unless otherwise specified, will be to the base IN. 

AGM Bound: Consider the following linear program: 

► Linear Program 1. 

Minimize ^ wr log(|i?|) such that Va £ A : ^ wr > 1 

R(zl1Z .R<E7£:a,6attr(.R) 

A valid assignment of weights wr to relation R in the linear program is called a fractional 
cover. If p* is the minimum value of the objective function, then the AGM bound on the 
join output size is given by IN P *. In general, for any set of relations 7 Z, we use AGM(7?) to 
denote the AGM bound on Mr g -r R. 

3.2 Degree Uniformization 

We describe our high level join procedure in Algorithm [l] In Step 1, we compute the degree of 
each value in each attribute set A, in each relation R. If the degrees are available beforehand, 
due to being maintained by the database, then we can skip this step. We further describe 
this step in Section pl~T| 

Steps 2,3 together constitute degree-uniformization. In these steps, we partition each 
relation R by degree. In particular, we assign each value in a relation to a bucket based 
on its degree: with one bucket for degrees in [1,L), one for degrees in [L,L 2 ), and so on. 
Then we process the join using one partition from each relation, for all possible combinations 
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Algorithm 1: High level join algorithm 
Input: Set of relations 77, Bucket range parameter L 

Output: m R 

1. Compute deg(u, R , A) for each 7? £ 77, A C attr(77), v £ 7 ta(R) 

2. Compute the set of all L-degree configurations Cl 

foreach c £ Cl do 

3.1. Compute partition 77(c) of each relation R 

3.2. Compute 77(c) = {77(c) | R £ 77} 

4. Compute join J c = ^r^tz(c) R 

5. return \J cgCl 


of partitions. Each such combination is referred to as a degree configuration. We use c to 
denote any individual degree configuration, Cl to denote the set of all degree configurations, 
77(c) to denote the part of relation 7? being joined in configuration c, and 77(c) to denote 
{77(c) | 7? £ 77}. Step 2 consists of enumerating all degree configurations, and Step 3 consists 
of finding the partition of each relation corresponding to each degree configuration. 

In Step 4, we compute J c = N r & tz(c) R for each degree configuration c. Section [4] describes 
how to perform Step 4 in a sequential setting, while Section [5] describes it for a MapReduce 
setting. Step 5 combines the join outputs for each c to get the final output. 

Steps 1, 2, 3 and 5 can be performed efficiently in MapReduce as well as sequential 
settings; thus the cost of Algorithm [l] is determined by Step 4. Step 4 is carried out differently 
in sequential and MapReduce settings. Its cost in the sequential setting is lower than the 
cost in a MapReduce setting. Steps 1, 2, and 3 have a cost of O(IN), while Step 5 has cost 
O(OUT). Since reading the input and output always has a cost of 0(IN + OUT), the only 
extra costs we incur are in Step 4 when we actually process the join. Costs for Step 4 will be 
described in Sections |4] and 0 

Degree-uniformization: Now we describe degree-uniformization in detail. We pick a value 
for a parameter L which we call ‘bucket range’, and define buckets 77/ = [L l , L l+1 ) for all 
l £ N. Let B = {7? 0 ,77i,...,}. For any two buckets 77/, Bj £ B, we say 77/ < Bj iff * < j. A 
degree configuration specifies a unique bucket for each relation and set of attributes in that 
relation. Formally: 

► Definition 1. Given a parameter L , we define a degree configuration c to be a function 
that maps each pair (77, A) with 7? £ 77, A C attr(7?) to a unique bucket in B denoted c(77, A), 
such that 

V77, A, A' : A' C A C attr(7?) => c(77, A) < c(77, A') 

V7? : c(77, attr(7?)) = B 0 and c(7?,0) = B\y ogi (|fl|)j 

► Example 2. If a join has relations Ri(X,Y), R 2 (Y), then a possible configuration is 

(7?i, 0) 77 3 , (7 ?i,{X}) i —> 7? 1; (T?!,}!}) h> 77 2 , (77i,{X,U}) i—* 77 0 , (77 2 , 0) <—► 77 1; 

(R 2 ,{Y})^B 0 . 

► Definition 3. Given a degree configuration c for a given L , and a relation 7? £ 77, we 
define 77(c) to be the set of tuples in 7? that have degrees consistent with c. Specifically: 

77(c) = {t £ 77. | VA C attr(7?) : deg(7y4(f), 77, A) £ c(77, A)} 

We define Cl to be the set of all degree configurations with parameter L. 
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► Example 4. For a tuple (a, b) £ R, where L 2 < \R\ < L 3 , with the degree of a in B\, and 
that of b in B 2 , the tuple would be in R(c) if c(i?, 0) = B 2 ,c(R,{A}) = Bi,c(R,{B}) = 
B 2 , c(R , {A, B }) = B 0 . On the other hand, it would not be in R{c) if c(R, {A}) = B 0 , even 
if we had c(R, {A, B}) = B 0 , c(R, {B}) = B 2 . 

A degree configuration also bounds degrees of values in sub-relations, as stated below: 

► Lemma 5. For all R £ 7Z, A! C A c attr(B), L > 1, c £ Cl, v £ iiA'{R),j > * > 0: 

c(R, A) = Bi A c(R , A 1 ) = Bj => deg(n, tta(R(c)), A') < U +1 ~ l 


Choosing L: The optimal value of parameter L depends on our application. L has three 
effects : (i) For the DBP/MO bounds (Sections |3.3| [5| and sequential algorithm (Section]!]), 
the error in output size estimates is exponential in L (with the exponent depending only on 
the number of attributes) (ii) The load per processor for the parallel algorithm (Section [5]) 
is O(L) (iii) the number of rounds for the parallel algorithm is log L (IN). As a result, we 
choose a small L(= 2) for the sequential algorithm and DBP/MO bounds, and a larger L 
(= load capacity = IN 7 for some 7 < 1) for the parallel algorithm. 

3.3 Beyond AGM : The MO Bound 

We now use degree-uniformization to tighten our upper bound on join output size. 

► Definition 6. Let 7Z be a set of relations, with attributes in A. For each R £ 7Z, A C attr(B), 
let dji t A = max„ e 7 rj 4 (fl)deg(u, R, A). If A = 0 then = \R\. And for any AC B C attr(i?), 
let d(A, B, R) denote log(d 7 rB (^) jJ 4 )- Then consider the following linear program for L. 

► Linear Program 2. 

Maximize s. t. (i) sg = 0 (ii) VA, B s.t. ACB:sa<Sb 
(iii) VA, B , E, R s.t. R £ 1Z,E C A, AC B C attr(B) : sbue < saue + d(A, B , R) 


We define tua to be the maximum objective value of the above program. 

► Proposition 7. The output size ^r^tz R is in C^IN” 1 - 4 ). 

This is proved in Appendix |E| Intuitively, for any A C A, sa stands for possible values of 
log(|7TA(Nfl e 7j, R) |). This explains the first two constraints (projecting onto the empty set 
gives size 1, and the projection size over A is monotone in A). For the third constraint, we 
use the fact that each value in A has at most values in B , thus each tuple in 

r’AuEi^RG’R. R) can give us at most IN rf( -' 4;B ' fti tuples in ttbue{mr£-r, R). The linear program 
attempts to maximize the total output size (IlNP^) while still satisfying the constraints. 

We now define the MO bound. 


► Definition 8. Let MO(72.) denote the value uia for any join query consisting of relations 
TZ. Then the MO bound is given by J2ceC 2 IN m0 ^ c ^. 

► Theorem 9. The MO bound is in 0(AGM(1Z)). 


The constant in the 0() notation depends on the number of attributes in the query, but not 
on the number of tuples. This result is proved in two steps. Theorem [26] states that the DBP 
bound (introduced in Section [ 5 ]) is smaller than the AGM bound, while Theorem 23 implies 
that the MO bound is smaller than the DBP bound times a constant. 



M. R. Joglekar and C. M. Re 


9 


► Example 10. Let L = 2 for this example. Consider a triangle join R(X,Y) N S(Y,Z) x 
T(Z,X). Let |i?| = 151 = |T| = N. The AGM bound on this is N 3 / 2 . Let the degree of each 
value x in X in both R and T be h. For different values of h we will find an upper bound on 
m {x,Y,z} an d hence on the output size. 

Case 1. h < \[N: Then S{x} < S 0 + d(0, {X},R) = log(N/h). Thus, S{x,y} < 
S{x> + d({X},{X,Y},R) < \og(N/h) + log(/i) = log(TV). Finally, s {x ,y,z} < «{x.r} + 
d({X}, {X, Z}, T) < \og(N) + logO). Thus the MO bound is < Nh < N 3 / 2 . 

Case 2. h > \J~N : Since there can be at most N/h distinct X values, we have 
d({Y}, {X, Y}, R) < log(N/h)). More if the degree of Y in S in a degree configuration 
is g, then s {Y ,z} < s {y} + d({Y},{Y,Z},S) < log {N/g) + log(g) = log(A). Finally, 
S{x,y,z } < S{ Y ,z } + d({Y},{X,Y},R) < log(A^) + log(A^/h) = log {N 2 /h) < N 3 / 2 . 

The MO bound has a strictly smaller exponent than AGM unless h « '/N. Computing 
the AGM bound individually over each degree configuration does not help us do better, as 
the above example can have all tuples in a single degree configuration. 

► Example 11. Consider a matching database [7 , where each attribute has the same domain 

of size N, and each relation is a matching. Thus each value has degree 1, and d{A 1 B , R) 
equals 0 when and 1 if A = 0. The MO bound on such a database trivially equals N, 

which can have an unboundedly smaller exponent than the AGM bound. 


Appendix F.3 similarly compares the DBP and AGM bounds, showing that DBP (and 
hence MO) has a strictly smaller exponent than AGM for ‘almost all’ degrees. 


3.4 Degree Computation 

If we do not know degrees in advance we can compute them on the fly, as stated below: 

► Lemma 12. Given a relation i?, A C attr(i?), and L > 1, we can find deg(u,i?, A) for 
each v £ xa(R) in a MapReduce setting, with 0(|i?|) total communication, in 0(log L (|i?|)) 
MapReduce rounds, and at O(L) load per processor. In a sequential setting, we can compute 
degrees in time 0(|i?|). 

The proof of this lemma is relatively straightforward and can be found in Appendix [B] 

To perform degree-uniformization, we compute degrees for all relations R, and all A C 
attr(i?). The number of such (R,A) pairs is exponential in the number and size of relations, 
but is still constant with respect to the input size IN. 


4 Sequential Join Processing 


We present our results on sequential join processing. Section |4.1| describes our problem 


setting. In Section 4.2 we present our sequential join algorithm, DARTS (for Degree-based 
Attribute-Relation Transforms). DARTS handles queries consisting of a join followed by 
a projection. A join alone is simply a join followed by projection onto all attributes. We 
pre-process the input by performing degree-uniformization, and then run DARTS on each 
degree configuration. DARTS works by performing a sequence of transforms on the join 
problem; each transform reduces the problem to smaller problems with fewer attributes or 
relations. We describe each of the transforms in turn. We then show that DARTS can be 
used to recover (while potentially improving on) known join results such as those of the 
NPRR algorithm, Yannakakis’ algorithm, the fhw algorithm, and the AYZ algorithm. 

In Section |4.3| we apply DARTS to the subquadratic joins problem; presenting cases in 
which we can go beyond existing results in terms of the runtime exponent. For a family of 
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joins called 1-series-parallel graphs, we obtain a full dichotomy for the subquadratic joins 
problem. That is, for each 1-series-parallel graph, we can either show that DARTS processes 
its join in subquadratic time, or that no algorithm can process it in subquadratic time modulo 
the 3-SUM problem. Note that 1-series-parallel graphs have treewidtlr 2, making them easily 
solvable in quadratic time. Thus, our 3-SUM based quadratic lower bound on some of the 
graphs is tight making it, to our knowledge, the only tight bound for join processing time 
with small output sizes. In contrast, there is a N3 lower bound (using 3-SUM) for triangle 
joins, but its matching upper bound depends on the additional assumption that the matrix 
multiplication exponent equals two. 

In Section |4.4[ we show that most results of the DARTS algorithms can be recovered 
using the well known framework of Generalized Hypertree Decompositions (GHDs), along 
with a novel notion of width we call m-width. We show that m-width is no larger than fhw, 
and sometimes smaller than submodular width. 


4.1 Setting 

In this section, we focus on a sequential join processing setting. We are especially interested 
in the subquadratic joins problem stated below: 

► Problem 1. For any graph G, we let each node in the graph represent an attribute and 
each edge represent a relation of size TV. Then we want to know, for what graphs G can we 
process a join over the relations in subquadratic time, i.e. 0(TV 2_e + OUT) for some e > 0? 


Performing a join in subquadratic time is especially important when we have large datasets 
being joined, and the output size is significantly smaller than the worst case output size. 
Note that we define subquadratic to be a poly (TV) factor smaller than TV 2 , so for instance a 
lo N algorithm is not subquadratic by our definition. 

As an example, if a join query is a-acyclic, then Yannakakis’ algorithm can answer it 
in time 0(TV + OUT), which is subquadratic. More generally, if the fractional hypertree 
width (fhw) of a query is p*, the join can be processed in time 0(TV p * + OUT) using a 
combination of the NPRR and Yannakakis’ algorithms. The fhw of an a-acyclic query is 
one. For any graph with fhw < 2, we can process its join in subquadratic time. The AYZ 


algorithm (described in Appendix A.5) allows us to process joins over length n cycles in 


time 0(TV 


2 - 


i+r^H 


OUT), even though cycles of length > 4 have fhw = 2. To the best 


of our knowledge, this is the only previous result that can process a join with fhw > 2 in 
subquadratic time. 

The DARTS algorithm is applicable to any join-project problem and not just those with 
equal relation sizes like in Problem |T| Applying DARTS to Problem [l] lets us process several 
joins in subquadratic time despite having fhw > 2. Section |4.4| recovers the subquadratic 
runtimes of DARTS using GHDs that have m-width < 2. 


4.2 The DARTS algorithm 

We now describe the DARTS algorithm. The problem that DARTS solves is more general 
than a join. It takes as input a set of relations 7 Z, and a set of attributes O (which stands 
for Output), and computes 7 to R- When O = A, the problem reduces to just a 

join. We first pre-process the inputs by performing degree-uniformization. Then each degree 
configuration is processed separately by DARTS. The L parameter for degree-uniformization 
is set to be very small (0(1)). The total computation time is the sum of the computation 
times over all degree configurations. Let G = (c,TZ(c),0). That is, G specifies the query 
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relations, output attributes, and degrees for each attribute set in each relation according 
to the degree configuration. We let cq,TZg,Gg denote to degree configuration of G, the 
relations in G, and the output attributes of G. We define two notions of runtime complexity 
for the join-project problem on G: 

► Definition 13. Q(G) is the smallest value such that a join-projection with query structure, 
degrees, and output attributes given by those in G can be processed in time 0(Q(G) + OUT). 
P(G) is the smallest value such that a join-projection with query structure, degrees, and 
output attributes given by those in G can be processed in time 0(P(G)). 

► Example 14. As an example of the difference between P and Q , consider a chain join 
G with relations R 1 (X 1 ,X 2 ), R 2 (X 2 ,X 3 ), R 3 (X 3 ,X 4 ), and G = {X 4 ,X 2 ,X 3 ,X 4 }. All 
relations have size N , and the degree of each attribute in each relation is y/N. Then P(G) 
would be TV 2 , the worst case size of the output (where all attributes have yfN values and 
each relation is a full cartesian product). Q(G) on the other hand would be N because the 
join is a-acyclic, and Yannakakis’ algorithm lets us process the join in time 0(N + OUT). 

4.2.1 Heavy, Light and Split 

The DARTS algorithm performs a series of transforms on G, each of which reduces it to a 
smaller problem. In each step, it chooses one of three types of transforms, which we call 
Heavy , Light and Split. Each transform takes as input G itself and either an attribute or a 
set of attributes in the relations of G. Then it reduces the join-project problem on G to a 
simpler problem via a procedure. This reduction gives us a bound on P(G) and/or Q{G) in 
terms of the P and Q values of simpler problems. We describe each of these transforms in 
turn, along with their input, procedure, and bound. 

Heavy: 

Input: G, An attribute X 

Procedure: Let TZx = {R € TZ(c) | X £ attr(P)}. Then we compute the values of x £ X 
that lie in all relations in IZx i.e. vals(X) = f),Rerc Y ^xR- Then for each x £ vals(A), we 
marginalize on x. That is, we solve the reduced problem: 

Jx = 7TO\{A'} (w Re(1l{c)\n x ) R ^ReUx (tTA\{X}0'X=xR)) 

Our final output is U ;c evais(A')( 7r 0 :r ) x For each relation R £ IZx, let d r be the maximum 
value in bucket c(P, {A}). So |vals(A')| < min^e^ g. Secondly, in each reduced problem 
J x , the size of each reduced relation ir^\{x} (J x=xR for R £ TZx reduces to at most d r. Let 
G' denote the reduced relations, degrees, and output attributes for J x . This gives us: 
Bound: Q(G) < (min^^g) Q(G') , P(G) < (min fl6Wjt g) P(G') 

Light: 

Input: G, An attribute set X 

Procedure: The light transform reduces the number of relations in G. Define IZx = 
{R £ lZ{c) | attr(P) CA'}. We compute Rx =^Ren(c) nxR- This subjoin is computed 
using a sequential version of the parallel technique in Section [5j Hence it takes time equal to 
the DBP bound on that join. Then we delete relations in TZx from G, and add Rx into TZq. 
The degrees for attributes in Rx can be computed in terms of degrees in the relations from 
TZx■ As long as \TZx\ > 1, this gives us a reduced problem G'. O stays unchanged for the 
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reduced problem. The size of relation Rx can be upper bounded using the DBP bound as 
well. Let DBP(G, X) denote this bound. 

Bound: Q{G) < DBP(G, X) + Q{G') , P(G) < DBP(G, X) + P(G') 


Split: 


Input: G, An articulation set S of attributes 12 such that there are joins Gi, Gi whose 
attribute sets have no attribute outside S in common, and Rq U Rgx CRg 2 - Also, S satisfies 
either (i) SCO, or (ii) O C U,r£ 7 ?, g attr(f?). 

Procedure: We compute Rs = t^s k Gi R)- This takes time P(G , 1 ), where G[ is like 
Gi but with Og\ = S. Let J 2 = (^rgtz G2 R) n Rs■ If O C U_rgtc g , attr(R), then we 
compute and output no J 2 , and we are done. This step costs P(G 2 ). Otherwise, SCO. We 
compute O 2 = 7Te> J 2 - Each tuple in O 2 has a matching output tuple for G. Then we set 
Rs = Rs D irs02 and compute 0\ = 7 T£)(m R k Rs)- Then for each tuple t £ Rs , we 

take each pair of matching tuples t\ £ 0\, t 2 £ 0 2 and output t\ n t 2 . Let G" be like G 1 , 
but with Oq'I = On (U.se7Z Gl attr(P)^, and G'{ be defined similarly. This gives us: 
Bound: If SCO, then Q(G)< P(G') + Q(G") + Q(G") 

!fOCU r&Ug attr(P), then P(G) <P(G / 1 ) + P(G 2 ). 


4.2.2 Combining the Transforms 

Once we know the transforms, the DARTS algorithm is quite straightforward. It considers 
all possible sequences of transforms that can be used to solve the problem, and picks the one 
that gives the smallest upper bound on Q{G). The number of such transform sequences is 
exponential in the number of attributes and relations, but constant with respect to data size. 
The P and Q values of various Gs can be computed recursively given a degree configuration. 
The G' obtained in each recursive step itself specifies a degree configuration, over a smaller 
problem. The degrees in G' can be computed in terms of degrees in G. Note that in some 
cases, we do not have cost bounds available e.g. we do not have a P bound for the Split 
transform when SCO. This is a part of the DARTS algorithm. DARTS only considers 
performing a transform when it can upper bound the resulting cost. 

We show that DARTS can be used to recover existing results on sequential joins. 


► Proposition 15. If we compute the join using a single Light transform, our total cost is < 


the AGM bound, thus recovering the result of the NPRR algorithm 16 


► Proposition 16. If we successively apply the Split transform on an a-acyclic join, with 
Gi being an ear of the join in each step, then the total cost of our algorithm becomes 
0(IN + OUT), recovering the result of Yannakakis’ algorithm 19 . 


► Proposition 17. If a query has fractional hypertree width equal to fhw, then using a 
combination of Split and Light transforms, we can bound the cost of running DARTS by 
0(W fhw + OUT), recovering the fractional hypertree width result. 


► Proposition 18. A cycle join of length n with all relations having size N , can be processed 
by DARTS in time 0(N 2 1+r t! + OUT), recovering the result of the AYZ algorithm 4 . 

These propositions are proved in Appendix [C] In the next subsection, we present a few of 
the cases in which we can go beyond existing results. Since we are primarily interested in 
joins, the output attribute set O below is always assumed to be A. 
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4.3 Subquadratic Joins 

Now we consider applications of DARTS to the subquadratic joins problem. Analyzing a 
run of DARTS on a join graph allows us to obtain a subquadratic runtime upper bound in 
several cases. Appendix |D. 1 1 mentions a simple extension of the AYZ result to graphs that 
are trees with cycles embedded in them. We now define a set of graphs for which we have a 
complete decision procedure to determine if they can be solved in subquadratic time modulo 
the 3-SUM problem. 

1-series-parallel graphs 

► Definition 19. A 1-series-parallel graph is one that consists of : 

• A source node Xg 

• A sink node Xt 

• Any number of paths, of arbitrary length, from Xg to Xt, having no other nodes in 
common with each other 

Equivalently, a 1-series-parallel graph is a series parallel graph that can be obtained using any 
number of series transforms (which creates paths) followed by exactly one parallel transform, 
which joins the paths at the endpoints. A cycle is a special case of a 1-series-parallel graph. 

► Theorem 20. For 1-series-parallel graphs, the following decision procedure determines 
whether or not the join over that graph can be processed in sub-quadratic time: 

1. If there is a direct edge (path of length one) between Xg and Xt, then the join can be 
processed in sub-quadratic time. Else: 

2. Remove all paths of length two between Xg and Xt, as they do not affect the sub-quadratic 
solvability of the join problem. Then 

3. If the remaining number of paths (obviously all having length > 3) is > 3, then the join 
cannot be processed in subquadratic time (modulo 3-SUM). If the number of remaining 
paths is < 3, then the graph can be solved in sub-quadratic time. 

Theorem |20] establishes the decision procedure for subquadratic solvability of 1-series- 
parallel graphs. Appendix |D ,4| gives an example of a subquadratic solution for a specific 
1-series-parallel graph, namely K 2 ,n, followed by an example on the general bipartite graph 
K m ,n- In both these examples, DARTS achieves a better runtime exponent than previously 
known algorithms. We now make three statements that together imply Theorem [20] They 
are formally stated and proved in Appendix [D] (Lemmas [33] [34] [35]. 


_ If we have a 1-series-parallel graph, which has a direct edge from Xg to Xt (he. a path 
of length 1), then a join on that graph can be processed in subquadratic time. 
h Suppose we have a 1-series-parallel graph G, which does not have a direct edge from Xg 
to Xt, but has a vertex Xu such that there is an edge from Xg to Xu and from Xu to 
Xt (i.e. a path of length 2 from Xg to Xt). Let G' be the graph obtained by deleting 
the vertex Xu and edges XgXu and XuXt- Then the join on G can be processed in 
subquadratic time if and only if that on G' can be processed in subquadratic time. 
h Let G be any 1-series-parallel graph which does not have an edge from Xg to Xt, but has 
> 3 paths of length at > 3 each, from Xg to Xt- Then a join over G can be processed in 
subquadratic time only if the 3-SUM problem can be solved in subquadratic time. 
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4.4 A new notion of width (m-width) 

We demonstrate a way to formulate the DARTS algorithm for joins (without projection) in 
terms of GHDs. 

For each A £ A, we define rriA similarly to how we defined m a in Section [3. 3 [ Specifically, 
for each A, we use the same constraints as in linear program [2] but the objective is set 
to Maximize s ,4 instead of Maximize sa- to a is then defined as the value of this objective 
function. We let Prog(A) denote the above linear program for finding rriA- Then the size 
\t^a{.^r^tz f2)| must be bounded by IN mA for all A C A (see Appendix Proposition |39j) . 
Moreover, for any GHD D = (T, x) °f query 1Z , we can define MW(D, 1Z) to be max te 7 -(m x ( t )). 
And MW(K) is simply the minimum value of M\N(D,1Z) over all GHDs D. Thus we have: 

► Definition 21. The m-width of a join query TZ (possibly with non-uniform degrees), 

is given by max ce c 2 MW(K(c)). 

► Theorem 22. A query with m-width MW can he answered in time 0(IN MW + OUT). 


This theorem lets us recover all our subquadratic joins results as well. That is, for the 
1-series-parallel graphs that have a subquadratic join algorithm (as per Theorem 201, we can 
construct a GHD that has m-width less than 2 (see Appendix E.4). 

We can show the MO bound to be better than the DBP bound (and consequently, the 
AGM bound, as stated in Theorem [9] earlier). 


► Theorem 23. For any join query 7 Z, and any degree configuration c £ C 2 , MO(7£(c)) < 
DBP(7?.(c), 2) + |C|log(2), where C is the cover used in the DBP bound. 


Note that since logarithms are to the base IN, the \C\ log(2) term is negligible even though 
it goes in the exponent of the bound i.e. its exponent is a constant. Theorems |22| and |23| let 
us recover all the results of the DARTS algorithm (see Appendix |E.4| ). 

The theorems also imply that our new notion of width (?n-width) is tighter than fhw. 


Appendix E.5 compares m-width to submodular width (which, barring m-width, is the 
tightest known notion of width applicable to general joins). Appendix E.5 shows examples 
where m-width is tighter than submodular width, but we do not know in general if m-width 
is tighter than submodular width. 

Appendix |E.6| shows that while m-width < 2 implies subquadratic solvability, the converse 
is not true; we show an example join which has ?n-width and submodular width = 2 but 
can be solved in subquadratic time. Thus known notions of width do not fully characterize 
subquadratically solvable graphs. 


5 Parallel Join Processing 

Like in sequential settings, degree-uniformization can be applied in a MapReduce setting. 
We first present the DBP bound, which is a bound on output size that is tighter than AGM 
bound (but not tighter than MO), and characterizes the complexity of our parallel algorithm. 
Then we present a 3-round MapReduce algorithm whose cost equals the DBP bound at the 
highest level of parallelism. 


The DBP Bound 

We start by defining a quantity called the Degree-based packing (DBP). 
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► Definition 24. Let 1Z be a set of relations, with attributes in A. Let C denote a cover i.e. 
a set of pairs (R, A) such that R £ 1Z, A C attr(i?), and |J^ A)eC ^ = A. Let L > 1. Then, 
consider the following linear program for C, L. 

► Linear Program 3. 

Minimize v a such that \/(R,A) £ C,VA' C A : v a > log 

a£A a£A' 

If Oc,l is the maximum objective value of the above program, then we define DBP(7U L) to 
be mine Oq,l where the minimum is taken over all covers C. 

► Proposition 25. Let L 1 be ci constant. Then the output size of 7 ? R is in 
0(IN DBP(^, i)) _ 


Ka(R),A\A> 


We implicitly prove this result by providing a parallel algorithm whose complexity equals the 
output size bound at the maximum parallelism level. We can now define the DBP bound. 
We arbitrarily set L = 2 for this definition (choosing another constant value only changes 
the bound by a constant factor). Thus, we define the DBP bound to be ^ c eC 2 IN DBP ^I C ^’ 2 \ 
As a simple corollary, the output size of the join is < the DBP bound. 

► Theorem 26. For each degree configuration c £ Cl, < AGM(7vl(c)). 

We prove this theorem using a sequence of linear program transformations, starting with 
the AGM bound, and ending with the DBP bound, which each transformation decreasing 
the objective function value. The key transform is the fifth one, where we switch from a 
cover-based program to a packing-based program. The proof itself is long and is deferred to 
Appendix |F.2| Appendix |F.3| contains a simple triangle-join example where the DBP bound 
has a tighter exponent than the AGM bound, and another more general example showing 
that the DBP bound has a strictly better exponent than AGM for ‘almost all’ degrees. 


Parallel Join Algorithm 


We present our parallel 3-round join algorithm. The algorithm works at all levels of parallelism 
specified by load level L. Its communication cost matches the DBP bound when L = 0(1). 
We formally state the result, and then provide an example of its performance (with additional 


examples provided in Appendix F.5l. 


► Theorem 27. For any value of L, we can process a join in 0(log L (IN)) rounds (three 
rounds if degrees are already known) with load 0(L ) per processor and a communication cost 
of 0(IN + OUT + max c£C[ L ■ IN DBP ( K ( c )-0). 

Proof. (Sketch) 

The join consists of the following steps: 


1 . 

2 . 


3 . 


Perform degree finding and uniformization using bucket range L, as shown in Section 3.2 


For each degree configuration, re-compute the degrees, and use them to solve Linear 
Program [3] for each cover. Let C be the cover that gives the smallest objective value. 
This smallest value will equal DBP(7?.(c), L). 

MapReduce round 1: .Join all the 7 ta(R) : (R, A) £ C in a single step using the shares 
algorithm. Each attribute a is assigned share IN”“, where v a is from our solution to 
Linear Program B] This ensures a load of Q (L) per processor, and communication cost of 
0(max ce ci, L ■ IN (Lemma 28 proved in Appendix). 
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4. MapReduce rounds 2 — 3: For each R such that (R, attr(f?)) ^ C, semijoin it with the 
output of the previous join. The semijoins for all such Rs can be done in parallel in one 
round, followed by intersection of the semijoin results in the next round. This can be 
done with 0(1) load and communication cost of 0(IN + OUT). 

► Lemma 28. The shares algorithm, where each attribute a has share IN U “, where v a is from 
the solution to Linear Program [3] has a load of O(L) per processor with high probability, and 
a communication cost of 0(max cg Ci L ■ IN DBP ^ C ^’^). 


◄ 

► Example 29. Consider the sparse triangle join, with 1Z = {Ri(X, Y),R 2 (Y, Z),R 3 (Z, A')}. 
Each relation has size N, and each value has degree 0(1). When the load level is L < N, 
the join requires DBP(7?.,T) = A processors. Equivalently, when we have p processors, the 
load per processor is A ; which means it decreases as fast as possible as a function of p. 

In contrast the vanilla shares algorithm allocates a share of ps to each attribute, and the 
load per processor is Np~ i. Current state of the art work |8j has a load of Np~ § as well. 

We further explore and generalize this example in Appendix |F.5| We also show an 
example where our parallel algorithm operating at maximum parallelism still has lower total 
cost than existing state-of-the-art sequential algorithms. 
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Conclusion and Future Work 


We demonstrated that using degree information for a join can let us tighten the exponent of 
our output size bound. We presented a parallel algorithm that works at all levels of parallelism, 
and whose communication cost matches a tightened bound at the maximum parallelism level. 
We proposed the question of deciding which joins can be processed in subquadratic time, 
and made some progress towards answering it. We showed a tight quadratic lower bound for 
a family of joins, making it the only known tight bound that makes no assumptions about 
the matrix multiplication exponent. We presented an improved sequential algorithm, namely 
DARTS, that generalizes several known join algorithms, while outperforming them in several 
cases. We recovered the results of DARTS in the GHD framework, using a novel notion of 
width that is tighter than fhw and sometimes tighter than submodular width as well. 

We presented several cases in which DARTS outperforms existing algorithms, in the 
context of subquadratic joins. However, it is likely that DARTS outperforms existing 
algorithms on joins having higher treewidths as well. A fuller exploration of the improved 
upper bounds achieved by DARTS is left to future work. Appendix |E.6| shows an example 
where a join can be performed in subquadratic time despite its m-width/submodular width 
being = 2. Thus the problem of precisely characterizing which joins can be performed in 
subquadratic time remains open. Moreover, we focused entirely on using degree information 
for join processing; using other kinds of information stored by databases to improve join 
processing is a promising direction for future work. 
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A Background 

A.l Generalized Hypertree Decompositions (GHDs) 

► Definition 30. Given a set of relations TZ over attributes A, a generalized hypertree 
decomposition is a pair (T, x) where T is a tree and % is a function from nodes of T to 2 A 
such that 

h For each relation R £ TZ, there exists a tree node in T that covers the relation, i.e. 
attr (R) C x (t). 

For each attribute A € A, the set of tree nodes containing A i.e. {t \ A £ x(^)} forms a 
connected subtree. 


The latter condition is called the “running intersection property”. The x(t) sets are 
referred to as ‘bags’ of the GHD. Using GHDs, we can define several notions of ‘width’, which 
capture the cyclicity of a query. For example, the treewidth of a GHD is the maximum value of 
| x(t) | — 1 over nodes t in 7”, and treewidth of a query is the treewidth of its minimum-treewidth 
GHD. Similarly, fractional hypertreewidth (fhw) is the maximum value of log /Ar (AGM(x(t))) 
over t £ T where AGM(x(f)) is the AGM bound over the set of attributes in \{t) for the 
given relations 7 Z. Again the fhw of a query is the minimum fhw over its GHDs. 

If the width of a GHD is w (for any of the known notions of width), then the size of 
the join m n^^R) is < IN'" for all t £ T. Thus the join can be computed by first 
computing the join within the bag as above, and then running Yannakakis’ algorithm 19 on 
the resulting relations with a runtime of IN 1 " + OUT. 


A.2 MapReduce 

In the MapReduce (MR) model, there are unboundedly many processors on a networked 
file system. Each processor has unbounded hard disk space and load capacity L (explained 
later). The computation proceeds in two phases. 

Step 1 : Each processor (referred to as a mapper), reads its tuples from its hard disk and 
sends each tuple to one or more processors (called reducers). The total number of tuples 
received by each reducer from all mappers should not exceed load capacity L. 

Step 2 : Each reducer locally processes the < L tuples it receives, and streams its output 
to the network file system. The output size at a reducer can exceed load capacity L as it is 
streamed to the network file system. 

The communication cost of each round is defined as the total number of tuples sent 
from all mappers to reducers. We measure the complexity of our algorithms in terms of 
communication cost and number of rounds. 


A.3 The Shares Algorithm 

Shares is a one-round MapReduce algorithm. Shares is parameterized algorithm, whose 
communication cost is different for different queries and machine sizes. Suppose we have a 
join R with attribute set A. Shares assigns a parameter Sa, called a ‘share’ to each 

attribute A £ A. It hashes each attribute A into Sa buckets using a hash function h A - It 
uses U AeA S A processors, with one processor corresponding to each tuple of hash values. For 
any processor P and attribute A, we use P(A) to denote the hash value of A corresponding 
to processor P. 

Shares uses a single round of MapReduce. In that round, each tuple t £ R, R £ TZ, is sent 
to every processor P such that P(A) = hA{t{A)) V A £ attr(A). Then, each processor joins 
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all the tuples it receives, and the final output of the join equals the union of the outputs 
produced by all processors. 

Each tuple in relation R gets sent to £i-A<i 3 ttr(R)SA processors. The communication cost 
of this algorithm is thus Yr<au l-^|n J 4 ^attr(i?)<S , A- The expected ‘load’ on each processor 
(number of input tuples it receives) is Yrgtl |-R|n j 4 eatt r (.R)(iSA) , which is simply the total 
communication divided by the total number of processors. On the other hand, the variance 
in load can be high, leading to some processors receiving a very high number of input tuples. 
In general, the shares Sa are chosen so as to minimize the total communication cost, given 
the number of processors. 

A.4 Articulation set 

Suppose we have a hypergraph H = (V,£) with £ C 2 V . The hypergrapli is connected if 
for each pair ui, i >2 £ V, there exists a sequence uq, u±, ..., Uk such that u/£VV0<*<fc , 
uq = t>i, Uk = V 2 and \/ i < k 3 e £ £ : Ui £ e A Ui +1 £ e. 

If T~L is connected, then an articulation set S is a set S C V such that the hypergrapli 
R~S = (V \ S, {e \ S | e £ £}) is not connected. Equivalently, S is an articulation set if 
3 V C V \ S such that Ve£f, either e C V' U S or e C V \ V. 

A.5 The AYZ algorithm 

Consider a join given by -Ri(Ai,A 2 ), i?. 2 (A 2 , A 3 ),. .. R n (X n ,X L ), for n > 4. This is the 
cycle join of length n. The cycle has fhw equal to 2, so the join can be processed in time 
0(N 2 + OUT). However, we can even process the join in subquadratic time as follows: For 
each attribute Xj, we compute the degree of each of its values. We choose a threshold A. 
We call any value with degree less than A light , and other values heavy. We process heavy 
and light values separately. The number of heavy values in an attribute can be at most ^. 
For each heavy value h in each attribute Xj , we ‘marginalize’ over the value i.e. restrict Xj 
to h. So effectively we compute the join with all values in Xj other than h removed. This 
effectively turns the join into a chain join 

Rj+lRj+2 • • • ) RnRlR-2 ■ • ■ , X J + , VXj^hRj-lR-j) 

Adding column Xj = h to the output of the chain above gives us the output for h. Let us 
call this output OUT^. Using Yannakakis’ algorithm on the chain lets us solve it in time 
0(N + OUTh). Thus, the total time for processing all heavy values in all attributes is 

]T 0(N + OUT^) = ^ O(N) + 0(0UT h ) 

h h h 

nN 

= 0{—N) + 0(0UT) 

N 2 

= 0(— + OUT) 

This way, we can find all outputs containing at least one heavy value. After this is done, 
we can delete all the heavy values, and process only light values. This is done by a simple 
brute force search. We start with each value in A'i, which has at most A neighbors in 
X 2 ,X n , which together have at most A 2 neighbors in A 3 , X„_i and so on. At An we take 
intersection of neighbors from both directions. The total running time for this procedure 
is the number of values in Ai i.e. N, times the total number of neighbors explored per Ai 
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value, which is A I" 2 1. Thus, the total processing time of the join is 

N 2 

0(—+N A^l + OUT) 

1 

Setting A = A rl+r ti gives us the minimum value of the running time, which is also 
subquadratic. 

A.6 3-SUM 

We first define the 3-SUM problem below. 

► Problem 2. The 3SUM problem : Given n integers aq, x 2 , x n all polynomial sized in 
n, do there exist three of those numbers, x t) Xj, Xk such that £,; + Xj + Xk = 0? 

There is no known algorithm for solving this problem in time 0(n 2 ~ e ) for any e > 0, and 
it is believed that such an algorithm does not exist. On the other hand, there is a known 
algorithm for solving the problem in time that is smaller than n 2 by a subpolynomial (log) 
factor. We next state the 3-XOR problem, which is subquadratically reducible from the 
3-SUM problem. 

► Problem 3. The 3XOR problem : Given n integers x\, £2, ••• £n all polynomial sized 
in n, do there exist three of those numbers, £*, Xj, Xk such that £,; ® Xj ® Xk = 0 where ® 
refers to bitwise xor? 


B Degree Computation 

► Lemma. Given a relation R and a A C att r(R), and a L > 1, we can find deg(v, f?, A) for 
each v £ tta(R) in a MapReduce setting, using 0(|i?|) total communication, in 0(log i (|f?|)) 
MapReduce rounds, and with O(L) load per processor. In a sequential setting, we can 
compute degrees in time 0(|i?|). 

Proof. Suppose the schema of R has K attributes X\, X 2 , ■ ■ ■, Xk- Let \A\ = K' < K. 
Without loss of generality, we can assume that A = [X-], X 2 , ... ,Xk'}- We want to find 
the degree of each value in nx 1 ,x 2 ,...,x K ,{R)- We make no assumption about the starting 
location of different tuples of R, each tuple of R could be in a different processor. 

We have \R\ x ^ processors, indexed by (ki,k 2 ), with 1 < Aq,A; 2 < |i?|. For each 
tuple (x\, £ 2 ,..., Xk) G R, its processor finds a hash Aq £ {1,2,..., |i?|} of (£ 1 , £ 2 , •.., Xk 1 )- 
In addition, the processor generates a random number k 2 £ |l,2,...,^|, and sends 
(£ 1 , £ 2 , • • ■, Xk>) to processor {k\,k 2 ). Each processor receives at most 0(L ) tuples in 
expectation, because of the second random hash. The first index of the processors (fci) 
corresponds to the tuple value. Because we have \R\ buckets for the first index, each 
hash value Aq should correspond to 0(1) distinct values of (£i,£ 2 , ... ,£*-')■ Each tuple is 
associated with a ‘count’ field. The initial value of the count field, when the tuple is sent to 
any processor in the starting step, is 1. 

The next log L (|i?|) steps are as follows: In each step, each processor (ki,k 2 ) locally 
aggregates the count of each of its tuples (since a processor may have recieved multiple copies 
of the same (xi,x 2 ,... ,xk>) value from different processors), and sends each aggregated 
tuple-count pair to processor (ki, Tyfl)- Thus, in log L (|i?|) steps, we will only have tuples 
in processors with k 2 = 1. Each processor (Aq, 1) should contain 0(1) distinct tuples and 
their counts. In each of these steps, the number of tuples received by a processor p would 
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correspond to the number of distinct values of (x±, x 2 , , Xk') that hash to the same value in 
{1,2,..., \R\}, times L (the number of processors sending tuples to p), which is O(L) (up to log 
factors as specified earlier). At this stage, for each value (xi,x 2 , ■ ■ ■, xk') € (Xi,X 2 ,..., Xk'), 
we have its total count, which equals its degree in R , as needed. 

In the sequential setting, we can simply have one processor simulate the MapReduce 
computation above. Its computation cost equals the sum of computation and communication 
costs of all Mappers and Reducers in all rounds. The total computation is fully subsumed by 
the total communication of the MapReduce algorithm, which is 0(|f?|). ◄ 


c 


Recovering previous results using DARTS 


C.l Proof of Proposition [15] 


► Proposition. If we compute the join using a single Light transform, our total cost is < 


the AGM bound, thus recovering the result of the NPRR algorithm 16 


Proof. If we perform a light transform, with set X equal to the set of all attributes in the 
join, then DBP(G, X) simply equals the DBP bound on the join. Theorem [l] tells us that this 
is less than the AGM bound on the join. Moreover, after the light transform, the resulting 
join only has a single relation Rx whose size equals the DBP bound on the join. Hence the 
P and Q values of the original join equal its DBP bound, and are < the AGM bound. ◄ 


C.2 Proof of Proposition [16] 


► Proposition. If we successively apply the Split transform on an a-acyclic join, with 
Gi being an ear of the join in each step, then the total cost of our algorithm becomes 
0(IN + OUT), recovering the result of Yannakakis’ algorithm 19 . 


Proof. We proce that the Q of an a-acyclic join is O(IN), which implies the proposition. 
We use induction on the number of relations in the join. It is clearly true when we have 
only 1 relation. Suppose Q equals input size for a-acyclic joins with < n — 1 relations, and 
consider an a-acyclic join with n relations. Because it is a-acyclic, it has an ‘ear’ i.e. it has 
a relation Ri and a relation R 2 such that each attribute on Ri is either unique to it, or is an 
attribute of R 2 as well. We apply the Split transform with S = attr(f?!) D attr(R 2 ). Since 
this is a join, O consists of all attributes, hence SCO. This lets us use the bound: 


Q(G) < P(G , 1 ) + Q(G") + Q(G 2 ) 


G[ has only one relation (f?i), so P{G \) is O(IN). Similarly, Q(G") is O(IN). Consider 
G 2 , which consists of a relation R$ and the relations in the original join other than R\. 
The attributes of Rs are a subset of the attributes of R\. We do a light transform with 
X = attr(J? 2 ). IZx definitely includes R 2 and Rs- Since the attributes in X are all contained 
in R- 2 , the DBP bound on this join is at most the size of R 2 i.e. O(IN). Moreover, the 
resulting join after the light transform has at most n — 1 relations and is a-acyclic. By 
the inductive hypothesis, its Q value is IN. Thus the Q value of the whole join is at most 
O(IN) + O(IN) + O(IN) + O(IN) = O(IN), which completes the proof. ◄ 


C.3 Proof of Proposition |T7| 

► Proposition. If a query has fractional hypertree width equal to fhw, then using a com¬ 
bination of Split and Light transforms, we can bound the cost of running DARTS by 
0( IN 7 '™ + OUT), recovering the fractional hypertree width result. 
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Proof. If fhw is the fractional hypertree width of the join, it means there exists a GHD 10,11 


such that the highest value of the AGM bound on the bags of the GHD equals lN^ hw . For 
each bag B of the GHD, we perform a Light transform with X equal to the set of attributes 
in the bag. The time taken for computing Rx is then the DBP bound on that join, which is 
less than the AGM bound, which is < IN^ hw (by the way the GHD was chosen). After all 
these light transforms, we are left with an a-acyclic join, where each relation size is < IN^ hw . 
Using Proposition 16 DARTS can process this join in time IN^ hw + OUT, proving that 
DARTS recovers the fhw bound. ◄ 


C.4 Proof of Proposition [T8] 

► Proposition. A cycle join of length n with all relations having size N, can be processed 

0 


by DARTS in time 0(N 


i +nn 


OUT), recovering the result of the AYZ algorithm 


Proof. Let Ri be the relation with schema Aj, Aj+i (R n has schema A n ,Ai). Our proof 
follows the AYZ algorithm described in Section A.5 Let A = A 1+r 5 1 . Then in degree 


configuration where at least one attribute A\ has degree > A in a relation, we perform 
a heavy transform on Aj. The number of distinct Aj values is at most Thus Q(G) < 


^Q(G'). Since G' is a-acyclic, Q(G') < N. Thus Q(G) < ^ = N 1+r fl . Now consider 
degree configurations where all attributes Aj have all degrees < A. Then we perform a 
sequence of n — 2 light transforms. In the (2 i + l) th step, we perform a light transform 
with X = {Ai, A 2 ,..., Aj + 3 }. And in the (2 i + 2) nd step, we perform a light transform 
with X = {Ai, A„, A n _i... A„_j_i}. The DBP bound for the R\ in the (2 i + l) s * and 
(2 i + 2) nd transform is < AA 1+1 . This can be proved inductively. For i = 0, setting cover 
C = {(i?i, { 1 , A 2 }), (Ri, {A 2 , A 3 })}. The solution to the linear program has iur 1i .m 1 ,a 2 } = 
w r->,{a 3 } = 1 and other values 0, which gives a output size bound of N A. The {Ai, A„, A„_ 1 } 
case is similar. Now assume the inductive hypothesis for upto i — 1. For i, We consider 
a cover C = {(R(, attr(R')), (i?j +2 , {Aj +2 ,j +3 })} where R( is the relation with schema 
{Ai, A 2 ..., Aj +2 } that was obtained from the last to last light transform. /?,' = N A 1 
by the inductive hypothesis. Then the solution to the linear program is w_R', a ttr (r 1 ) = 
w Ri + 2 ,{A i+2 ,A i+3 } = giving the required bound of NA l+1 . THe other case (for (2 i + 2) nd 
light transform is similar). At the end of these transforms, we will have two relations, Ri with 
schema Ai, A 2 ,..., A|-nq and size and relation R r with schema Ai, A n , ..., Ap«i 

and size < AtAi^l -1 . Since these two relat ions now form an a-acyclic join (any two relations 

Ft + OUT) as 


form an a-acyclic join), we use Proposition 
required. 


16 


to join them in time 0(N 


1+ r 7 


D Subquadratic Joins 

D.l Tree-Cycle Structures 

We mention s simple extension of the AYZ result. 

► Definition 31. Tree-Cycle Structure (TCS): 

1. A cycle of any length (including 1, which gives a single isolated node) is a TCS 

2. If Tj and T 2 are two disjoint TCSs, then adding an edge from any vertex of Xj to any 
vertex of T 2 gives a new TCS. 

3. All TCSs can be formed by the above two steps. 

We can show that joins on TCSs can be processed in subquadratic time as well. 
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► Theorem 32. A join over a TCS T = {V,E) can be found in time 0(N 2 1 +r 3i + OUT), 
where n is the length of the longest cycle in the TCS. 

The definition of Tree-Cycle structures can be extended to include other graphs for which 
we show subquadratic solvability. 

D.2 Subquadratic 1-series-parallel graphs 

► Lemma 33. If we have a 1-series-parallel graph, which has a direct edge from Xs to Xt 
( i.e. a path of length 1), then a join on that graph can be processed in subquadratic time. 

Proof. Let Z be the set of paths of length > 1 from Xs to Xt- We use induction on \Z\. 
If \Z\ = 0, then the join is just a single edge, which gets processed in time O(N). Now 
assume we have a subquadratic ( N 2 ~ ek ~ 1 ) solution for \Z\ = k , and let \Z\ = k. Now for 
any Z £ Z, we perform a split transform, with articulation set consisting S = {Xs, Xt}, 
and G\ consisting of the attributes of Z. Since Gi is now a cycle, it’s Q is < N 2 ~ e for some 
e > 0. And since SCO, we have 

Q(G) < P(G[) + Q(G'I) + Q(G'') 

= P(G\) + N 2 ~ e + N 2 -^- 1 

So to show subquadraticness, it suffices to show that T’(G' 1 ) is subquadratic. To do this, 
suppose the length of path Z is n. Let 5 = N N + 2 . 

• Suppose all attributes in Gi have degree < 5. Then we perform a sequence of light 
transforms until the join is solved, at a total cost of N6 n which is subquadratic. 

• Suppose the path Z is given by A'o = Xs, X \,..., X n = Xt- If any attribute Xi in Gi 
has degree > S, we perform a heavy transform on it. After a heavy transform, we are left 
with a chain Xi + 1 , A; +2 ,..., Xt, Xs, X \,..., A/_|. Then we perform a split tranform 
with articulation set A )_ 2 , and Gi consisting of X;_ 2 , Aj_i. Since the output attribute 
set consists of Xs,Xt, which lies entirely in G 2 , we use the split bound 

P{G) < P(G \) + P(G 2 ) 

Here, the P(G , 1 )term is simply N, so this split transforms effectively removes A)_ 1 from 
the chain. We can similarly remove remaining attributes from the edges,leaving only Xs 
and Xt, which gives a P value of N, which is subquadratic. 

This shows that the join can be processed in subquadratic time. ◄ 

► Lemma 34. Suppose we have a 1-series-parallel graph G, which does not have a direct 
edge from Xs to Xt , but which has a vertex Xu such that there is an edge from Xs to Xu 
and from Xu to Xt (i.e. a path of length 2 from Xs to Xt). Let G' be the graph obtained 
by deleting the vertex Xu and edges XsXu and XuXt- Then the join on G can be processed 
in subquadratic time if and only if that on G' can be processed in subquadratic time. 

Proof. One direction of the lemma is easy to prove: If G' requires quadratic time to solve, 
then by setting XsXu and XuXt to be full Cartesian products, we make join G equivalent 
to G', which means it must take quadratic time. 

Now assume G' can be solved in time N 2 ~ e for some e > 0. Firstly, if Xu has degree 

► iV 1- 5 in either relation, then we perform a heavy transform on Xu, giving a total cost of 
< N 2 ~ 2 , which is subquadratic. So now assume the degree of Xu is < N x ~ 2 . Then perform 
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a light transform on {Xs, Xjj, Xt}, to get a relation of size < N 2 ~i. Then split with G\ 
consisting of Xs,Xjj,Xt■ This gives a relation with attributes XsXt of size < TV 2 - 4, to 
be added to G 2 . 

Now the proof is similar to the proof for the previous lemma. We again have an edge from 
Xg to Xt, along with a number of other paths. Only this time, the edge relation has size 
< N 2 ~i , rather than = N. But like before, we can choose a path Z, and let its length be n. 
Then we perform a split with articulation points Xs,Xt, and G\ consisting of attributes 
of Z. Then we are left with a P(G' 1 ), where Oq\ = {Ws, Xt}- Like before, we choose a 
small enough S (= lV 2ri + 4 ) such that if all attributes in Z have degree < S in relations if size 
N, then we perform a sequence of light transforms that give total cost N 2 ~i +S which is 
subquadratic. 

If the attributes don’t all have degree < <5 in relations of size N, then choose the smallest l 
such that Xi has degree > S (where Z is again written as X 0 = Xs, X \,..., X n -\,X n = Xt)- 
Suppose its degree is d. Then we perform light transforms for {X 0 , Xi,X 2 }, {X 0 , X \,..., X 3 }, 
... {X 0 ,Xi ... , X)_i}, which give a total cost of AT 2- !)^, getting a relation Ri with 
attributes X n , Xq, A'i, ..., Xi. Let the degree of Xi in Ri be d!. Now we perform the heavy 
transform on Xi , which has at most min(^, ^r-) distinct values. For each value, we get a 
chain, where each relation is of size < N, except for Ri which has size d!. Then using split 
transforms like in the previous lemma proof, we can take out Xi + i, Xi +2 and so on one by 
one, and be left with Ri alone, which is projected down to {Xs, Xt}- This gives a cost of 
N + d! per a € X/. The total cost is thus min(^, ^4) x (N + d') < + \Ri\, which is 

subquadratic. This proves the lemma, as required. ◄ 

D.3 3-SUM Hardness Proof 

We formally state and prove the lemma for 3-SUM hardness of certain 1-series-parallel graphs. 

► Lemma 35. Let G be any 1-series-parallel graph which does not have an edge from Xs to 
Xt, but has > 3 paths of length at > 3 each, from Xs to Xt- Then a join over G can be 
processed in subquadratic time only if the 3 -SUM problem can be solved in subquadratic time. 

We will reduce our join problem to the 3-XOR problem. We only prove hardness for the 
simplest 1-series-parallel graph having 3 paths of size = 3 here. Joins on larger graphs can 
easily be reduced to this graph. Thus, we prove the theorem below (We use slightly different 
notation for the attribute names for convenience): 

► Theorem. Consider a join over graph G with attributes A, B\, C\, B 2 , C 2 , B 3 , C 3 , D, 
and relations Ri(A, Bf),Si(Bi, Ci),Ti{Ci , D) : V? £ {1,2, 3}, where each relation has size N. 
Suppose for some c > 0, there is an algorithm that processes the join in time 0(1V 2_C + 0UT). 
Then 3-SUM can be solved in time 0(X 2- *) for a t > 0. 

Proof. We can assume that N is a power of 2. If it is not, we can simply introduce some 
dummy numbers while increase the problem size by at most a factor of 2. Suppose we have 
a c > 0 and a corresponding algorithm. Now consider any 3-XOR instance x±,x 2 , ...Xn. We 
will use the join algorithm to subquadratically solve this instance. We use a family of linear 
hash functions: 

Hash Function h: For input length l and output length r, the function h uses r Z-bit keys 
d = (di, a 2 , ...a r ) and is defined as ha(x) = ((aq, x), (a 2 , x), ...(a r , x)) where (a, b) denotes 
inner product modulo 2 . 

This hash function is linear, i.e. hfx) + h{y ) = h{x + y) where addition is bitwise-xor. 
Also, ha( 0) = 0 for all a, and Pr s [ha{x) = ha(y)\ < 2~ r for any x ^ y. 
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We pick a small d > 0 (the exact value will be specified later), and let H = N 1+d . Assume 
we picked the d such that N 1+d is a power of 4 (We can always do this for sufficiently large 
N). We will hash down our numbers to [H], i.e. to r = log (H) bits. The linearity of the 
hash function means that if ay + x + j + Xk = 0, then h(xf) + h(xj) + h(xk) = 0 as well. On 
the other hand, if Xi + Xj + Xk ^ 0 , then the probability that h(xf) + h(xj) + h(xk) = 0 is -g. 
We will try to solve the 3-XOR problem over the hashed values, and if the original problem 
has a solution (3 numbers that sum to 0), then so will the hashed values. On the other hand, 
the expected number of false positives (triples of numbers that don’t sum to zero, but whose 
hashed values sum to 0) is given by the number of triples times the probability of a false 
positive, i.e. ^ = N 2 ~ d . 

Let a = d . We have H buckets containing N numbers total. Call a hash bucket heavy if 
it has more than N a elements. We would like to bound the number of elements that are 
contained in ‘heavy’ buckets. 

We use a Lemma from Reference | 6 |: 


► Lemma 36. Let h be a random function h : U i—>• [H\ such that for any x y, 
Prh [h(x) = h(y)] < jj. Let S be a set of N elements, and let Bh(x) = {y £ S \ h(x) = h(y)}. 
For all k, we have 


Prh’X 


2 N 

\B h (x)\>— + k 


1 

< - 
~ k 


In particular, the expected number of elements from S with \Bh{x)\ > + k is < ^. 


Thus, the expected number of elements in ‘heavy’ buckets is 7V 1-a , which is in o(N). For 
each heavy element, we can try summing it with each other Xi, and see if the resulting sum 
is one of the ays. Thus, we can check the sum condition on all heavy elements in time N 2 ~ a . 
Thus, we can now assume that all buckets have < N a elements. 

We now present an instance of the join that is reducible from the 3-XOR problem instance. 
For each attribute Bi and Cj, their values consist of all bit combinations with ( 1 + d )^°g( jv ) bits. 
Thus, there are N distinct attribute values for each of those attributes. Attributes A and 
D have N 1+d distinct attribute values each. Each relation Si(Bi,Ci) has up to N edges as 
follows. For each x t from the original problem that was not in a heavy bucket, we express 
it’s hash value as h{xf) = bi + N^~ c,;. Then, we add an edge between values bi € Bj and 
Cj £ Cj for j = 1, 2, 3. For relations Rj(A , Bj) and Tj(Cj,D), we do the following: Consider 
all triples ti,i, tj, 2 , t »,3 of ( 1 + d F°g( iy ) _^ numbers whose bitwise-xor is 0. There are N 1+d 
such triples. For each such triple, we take one element a* £ A , and connect it to each of 
Up £ B\, t-i /2 £ B 2 , tip £ B 3 . Similarly, we take one element di £ D and connect it to each 
of tip £ Ci, tip £ C 2 , tip £ C3. Thus, we have a join instance with relations of size N 1+d . 
Setting up this join instance given the 3-XOR instance takes time 0(N 1+d ). 

Now we analyze the output of this join instance. Suppose we have an output tuple 
a £ A,b\ £ B,b 2 £ B 2 ,b 3 £ B 3 ,Ci £ Ci,c 2 £ C 2 ,c 3 £ C 3 ,d £ D. From our relations, 
we know that there is an x. t p whose hash equals b\ + c 1; an ay .2 whose hash equals 

b 2 + N^ c 2 , and an a pp whose hash equals b 3 + N 1 ^c 3 . Moreover, since a is connected to 
61 , b 2 ,b 3 , we know that the bitwise xor b\ + 6 2 + 63 = 0. Similarly, Ci + c 2 + c 3 = 0. Hence 
the bitwise xor h(xip) + h{xi i2 ) + h(xip ) = 0. Thus, either the triple (xip, Xip, Xip) is a 
solution to the 3-XOR problem, or it is a false positive. 

Now we apply the subquadratic join algorithm whose existence we assumed, on our 
join instance of size N 1+d . If it runs for time greater than 0 (NA+ d '>( 2 ~ c ) -p N 2 ~^), we 
terminate it and return ‘true’ for the 3-XOR problem (we will justify this later). Otherwise, 
for each output tuple, we get a triple of hash buckets whose bitwise xor is zero. For each 
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such triple of buckets, we check the (at most N 3a ) corresponding triples of x^s and check 
if they sum to 0. This takes time 0(N^ +d ^ 2 ~ c '> + N 2 ~ d+3a ) = 0(N^ 1+d ^ 2 - c ^ + N 2 ~ a ). If 
we find such a triple, then we return true for the 3-XOR problem. If we don’t find such 
a triple for any of the outputs of the join, we return false. Now recall that the expected 
number of false positives is N 2 ~ d . If the correct answer to the 3-XOR problem is false, then 
the program should terminate in time 0{N ( ' 1+d ^ 2 ~ c ' ) + N 2 ~ d ) with high probability. This 
justifies our decision to return true if the program runs for a polynomially longer time, as the 
probability of the correct answer falls exponentially as the program keeps running beyond 
0( _/V( 1_l "^)( 2 — c ) + N 2 ~ d ). 

This means we can solve the 3-XOR problem with high probability, in time 0(N 2 ~ a + 
X 1+d _|_ jy(i+d)( 2 -c) _|_ jy2-dy g 0 we ( q 100se ^ sm all enough such that (1 + d)( 2 — c) < 2 , 
and set t = min(a, 1 — d, 2 — (1 + d)( 2 — c)). This way, 3-XOR can be solved in time X 2_t , 
proving the theorem. ◄ 

D.4 DARTS Application examples 

► Example 37. Joins over A' 2 ,n, a special case of 1-series parallel graphs, have some potential 
applications for recommendations. AT 2>n consists of attributes X, Z on one side, connected to 
each of Yi,F 2 ,..., Y n on the other side. Joining over A^n where each relation is an instance 
of a friendship graph gives us pairs of people who have at least n friends in common, along 
with the list of those friends. If instead the X attribute is a netflix user id, Z is a movie 
id, and Yjs are attributes such as genres, then the join could be interpreted to mean “find 
user-movie pairs such that the user likes at least n attributes of the movie”. 

As an example of using DARTS for 1-series-parallel graphs, we prove that a join over 
A' 2 ,n can be processed in subquadratic time. The join has relations Ri(X, Yj), Si(Yi,Z) for 
all 1 < i < n. We prove that the Q of the join is subquadratic using induction on n. 

Base Case: If n = 1, the graph is a chain, and can be solved in linear time using 
Yannakakis’ algorithm. Since DARTS includes Yannakakis’ algorithm as a special case, it 
can solve the chain in linear time as well i.e. Q = O(N). 

Induction: Now we assume that Q for A' 2> „ is < A 2_<s ", for some 5 n > 0. Consider A' 2 jTI+ i. 
For any degree configuration c in which at least one of the Y^’s has a degree greater than 
A 2 , we perform the heavy transform on that Y^. The number of Yj’s is less than N 2 , 
and the reduced graph is a A' 2 l n, which has Q < X 2_<s ". Thus, the heavy transform gives us 
Q < X 2 for configuration c of AT 2 ) „ + i. On the other hand, if the degree configuration c 
has all KjS having degree < N 1 _ T L ) then we perform light transforms on {X,Yi,Z} for each 
i one by one, and end up with relations Ak, (X, Y l . Z) of size < N 2 ~~S". Now for each i, we 
perform Split transforms using articulation set {A, Z }, and G\ consisting of X,Yi, Z. Then 
P(G 1 ) < N 2 ~~S" and the projection onto XZ is of size < A 2 ”"? as well. This upper bounds 
Q by A 2-- ?. Thus, the Q of A ' 2 j?l +i is subquadratic, which completes the induction. 

► Example 38. The runtime improvements of DARTS are not limited to treewidtlr 2 joins. In 
general, we can process the join on the complete bipartite graph A' mj „, which has treewidth 
min (m,n), in time 0 (IN min ^ m ’ n ) _e '"."- _|_ OUT), where e m ^ n > 0. Simply marginalizing on an 
attribute on the m side gives us a time bound of 0(IN mln ^ m_1 ’"^ _em - 1,n+1 + OUT). But we 
get e m ^ n > e m _i, n , which means DARTS does more than just marginalize on an attribute. 

For example, consider A 3 3 which has treewidth 3. All relations have size A, and let the 
attributes be X\, X 2 , A 3 on one side and Yi, Y 2 , I 3 on the other. Suppose we can process a 
join over A ' 2j 3 in time 0(X 2-e2 ’ 3 + OUT). Set A = X^ 2-62 ’ 3 )/ 3 . If the degree for an attribute 
is > A, we could marginalize on it and achieve a runtime of 0(A~ 1 A 3_e2 ' 3 + OUT). On the 
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other hand, if all degrees are < A, then we can perform a Light transform on {Ai, A 2 , A 3 , Yj} 
for each i to get 3 relations of size < N A 2 each. We can join them using Split transforms, 
getting a runtime of 0(N A 2 + OUT). Either way, the runtime of DARTS is bounded by 
0(N 3 ~^a + OUT) where e 3l3 = 3 - (1 + 2((2 - e 2 , 3 )/3)) = 2(1 + e 2j3 )/3 < 1 + e 2 , 3 . 


E Proofs on m-width and MO bound (Section |3.3| |4.4[ ) 

E.l Proof of Theorem |7] 

We first state and prove a more general proposition, and Proposition [7] will be a corollary. 


► Proposition 39. For all A C A, we can compute a relation Ra in time 0(IN mA ) such that 
(i) 1 72 , 4 1 < IN mA (ii) 7Ta(x Ren 72) C R A (where tua is as defined in Section 4.41. 


Proof. For each ACd, let Oa = ^A^Ren 7?)- Fix any A C A and consider the solution to 
Prog(A). In the solution, there must be at least one tight constraint of the form < Sr for 
AC B or sa < sr + d(P , Q , 72) for some P, Q, E, R such that P C Q C attr(22), B = PUE, 
A = Q U E . Then in turn, there must be a similar constraint on Sr ■ The only constraint in 
the system that does not have one relation on the LHS and one on the RHS is the sg = 0 
constraint. 

Thus there must be a chain Ao, Ai ,..., Ak such that Aq = 0, Ak = A and there is a tight 
constraint with A i+ i on the LHS and A, on the RHS (i.e. Ai + 1 < A t + .... Then we produce 
a sequence of relations 22o, • ■ •, Rk such that for alii : |72i| < IN rra ' li . The final T2& equals 
our Ra- We produce these relations inductively: If A i+ -[ C Ai, then we set Ri+i = iTA i+1 Ri- 
Otherwise, there exist P, Q , R, E such that P C Q C attr(P), Ai = P U E, A^i = Q U E 
and SA i+ 1 = SAi + d(P,Q, R). Then we set P^+i = Ri n ttq(R). Since these operations 
only involve relations in the original join, all 72;S satisfy Oa, C 22,. Moreover, for all i, 
|72j| < IN SAi . Thus, Rk is computed in time 0(IN SAfc ) = 0(IN mA ), and satisfies Oa U Rk- 
Setting Ra = Rk gives us the required Ra satisfies conditions (i) and (ii) of the proposition, 
completing our proof. ◄ 


► Proposition. The output size Mr^r R is in 0(IN mA ). 


Proof. For each A C A, let Oa = ^A(^<ReTi 72). We set A = A in Proposition 39 Oa is 
simply the output of the join n r g r R and since it is a subset of Ra which has size < IN mA , 
the output itself must have size 0(IN mA ). ◄ 


E.2 Proof of Theorem [221 

► Theorem. Any join query can be answered in time 0(IN MW + OUT), where MW is its 
to- width. 


Proof. For all A C A, let Oa denote 'KA^Ren 72). Given a GHD (T,x) with m- width 
equal to MW, we perform the join in three steps: 

For each bag \{t) of the GHD, we compute R x (t) like in Proposition 39 That is, we 
compute R x (t) in time 0(IN m * (t) ) such that (i) |72 x (t)| < IN m * (t) (ii) O x ^ C R x (t)- The 
latter property ensures that Oa P x (t)- Moreover, by definition of m-width, the 

computation time for each R x ^) an d the size of R x (t) are bounded by 0(IN MW ). 
h Then for each bag %(f), we compute P' xttj which is R x (t) senri-joined with n x ^(R) for 
each 22 G 72.. This ensures that Oa =MteT 72 x ( t y Moreover, |72^ (t ^| < |72 x ( t )| < IN MW . 
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h Then we use Yannakakis’ algorithm to join all the R x (t)’ s - This can be done in time 
0 (IN mw + OUT), completing the proof. 

◄ 


E.3 Proof of Theorem 1231 

► Theorem. For any join query 1Z , and any degree configuration c £ C 2 , M0(7£(c)) < 
DBP(7\!.(c), 2) + \C\ log(2), where C is the cover used in the DBP bound. 

Proof. DBP(7?.(c), 2) is obtained by solving Linear Program [3] for the optimal cover. Let C 
be the optimal cover, and v a be the value in the optimal solution for each a £ A. And for 
each A C A, let denote the value in the optimal solution for the linear program Prog(A). 

Let C = Ai), (i? 2 , A 2 ),..., (i?|c|, v4|c|)}, where Ri £ 1Z and A % C attr (Ri) for all i. 
Define Bj = Ui=i Aj for all 1 < j < \C\. Since C is a cover, we must have B\c\ = A. 

Now for each j , we will show that s Ej < j log(2) + Eaes v a- We do this using induction 
on j. Then for j = \C\ the LHS Sb | c| equals MO(7J(c)) and RHS |Cj log(2) + J2 a eA Va ec l ua ls 
DBP(7?.(c), 2) + |C| log(2), proving our theorem. 

Base Case: For j = 1, setting R = Ri, A = A 1 , A' = A 1 for Linear Program [3] gives 
us the constraint v a — l°g(^ 7 iAnd Prog(A) with A = 0, B = A 1; R = Ri, 

E = 0 gives us the constraint s^i < + d(0, A 1; f?i) = log) < l0g(2) + EaeA, V a- 

Then since Bi = Ai, our base case is proved. 

Induction: Suppose we have proved sbj < jlog(2) + ° a -^ or J ~ 1- Now let 

Ej = Bj \ Bj- 1 . Then Linear Program [3] with R = Rj, A = Aj, A! = Ej gives us 
E aeEj Va > \og{dn A . {Rj)AAEj /2). Prog(A) with R = Rj, A = Aj\ Ej, B = Aj, E = Bj _ x 
gives us SBj-tuAj < S(A i \B,)us,_i + "', 1 .l./,’ • Now Bj = Bj-jUAj by definition of Bj, 
and ( Aj\Ej ) U Bj -1 = Bj -1 since Aj C Bj = EjUBj-i. So s Bj < s Bj _ 1 +^og(d 7TA ^ Rj):Aj \ Ej ) 
< s Bj - 1 +^og{2)+J2 aeEj v a . And by inductive hypothesis, SBj-i < {j— 1) log(2)+E ae s j _ 1 v a - 
This gives us s Bj < j’log(2) + EaeSj v r 

This proves that s Bj < , 7 Tog( 2 ) + Eaes v ° ^ or an d consequently that M0(7?.(c)) < 
DBP(7?.(c), 2) + \C\ log(2), completing our proof. ◄ 


E.4 Recovering DARTs results using GHDs 


Theorem [23] shows that the MO bound is smaller than the AGM bound. As a result, the 
MW of a GHD is smaller than its fhw. This lets us recover Propositions 15]p~T We now show 
how to recover the subquadratic join results from Theorem [20] and the AYZ result. 


E.4.1 Recovering AYZ 

► Proposition. A cycle join of length n with all relations having size N, has m-width 
< 2 — 1+ |„-| , recovering the result of the AYZ algorithm. 

The cycle join has relations R\{Xi, X 2 ),..., R n {X n , Xi) of size N each. Choose A = 
1' 

iV i+ r ^/21 as before. We will show that for each degree configuration, we can construct a 
GHD that has MW < 2 — .-r. 1 . 

— l+|n/2] 

Suppose the configuration is such that the degree of some Xp ; if > A, then we build a 
GHD with a bags {Xfc}Uattr(fty) for each j. The bags form a chain {Xk}L)Rk, {Xk}URk+i, 
{Xfc} U Rk + 2 , ■ ■ ■, {AA} U Rk- 2 , {A'fc} U Rk- 1 , which gives us the GHD. The m value for 
each bag is bounded by log(A 2 A _1 ) since m aX xr(Rj) < log (N) and using A = 0, B = { X *,}, 
and d(A,B,Rk) < log(AA _1 ). 
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If all degrees in the configuration are < A, we form a GHD with two bags: {X-|, X 2 , ■ ■ 
X\n/ 2 \) and {Xi, X n , X n _i,..., X^ n / 2 \ }• The to value of each bag is still NA^ 11 / 2 ^ = X 2 A _1 . 
This time, we have TO{ x t ,x 2 } < log(IV) and for each i, m^x 1 ,...,x i+1 } A , m{x 1 ,...,x i } +log(A) 
since d(A, B , R) = log(A) for A = {Aj}, B = {X i: X i+ 1 }, R = Ri. 

Thus for each degree configuration, we can find a GHD with MW < W 2 A -1 , which 
implies that to- width is < 2 — 1+ |-^/ 2 ] 1 which lets us recover the AYZ result. 

E.4.2 Lemma [331 

► Lemma. If we have a 1-series-parallel graph, which has a direct edge from Xs to Xt (i.e. 
a path of length 1 ), then the to- width of a join over the graph is < 2 . 

Proof. Once again, we will show that for any degree configuration, we can construct a GHD 
with MW < 2. Suppose there are k paths from Xs to Xt excluding the XsXt edge. Each 
of the k paths, along with edge X$Xt forms a cycle. For each cycle, we form a GHD for the 
given degree configuration like we did for the AYZ recovery. Call these GHDs D\, £) 2 ,..., Dj~- 
Since we have an edge XsXt, each Di contains at least one bag Bi that contains both Xs 
and Xt- We create a new bag {As, Xt}, and connect it to each Bi for 1 < i < k. This 
gives us a GHD for the full join, and the to value of its bags is no more than it was in the 
original GHDs, which was shown to be < 2 when we recovered AYZ. As a result, when there 
is a XsXt edge, we have GHD with MW < 2 for every degree configuration, and thus the 
to- width of the join is < 2 . ◄ 

E.4.3 Lemma 1341 

► Lemma. Suppose we have a 1-series-parallel graph G, which does not have a direct edge 
from Ag to AY, but has a vertex Xjj such that there is an edge from A 5 to Ay and from 
A u to A t (i.e. a path of length 2 from A 5 to At). Let G 1 be the graph obtained by deleting 
the vertex Xjj and edges XgA'y and XjjXt- Then the m-width of a join on G is < 2 if and 
only if the m-width of the join on G' is < 2 . 

Proof. We have edge XgXy and Ay At and no direct edge XsXt- As before, one direction 
is easy to prove. Suppose the m-width of the join over G is < 2. That is, the join on G has 
a GHD with MW < 2 for all degree configurations. Then for any configuration d for G ', 
consider the corresponding configuration c for G where Ay has degree N in both its relations 
and other degrees are the same. Consider the GHD with MW < 2 for this configuration 
on G. We have sr Xu > = 0 and sa = sau{x v } f° r all A C A. Then the GHD obtained by 
removing Ay from each bag gives us a GHD for G' with MW < 2. This implies that the 
to- width of the join over G' is also < 2 . 

Now suppose the m-width of the join over G' is < 2. That is, there is an e such that for 
each degree configuration for G ', there is a GHD with MW < 2 — e. Now consider any degree 
configuration c for G and the configuration d for G' obtained by keeping the same degrees 
for all values (not in Ay). Suppose Ay has degree > X 1- Y then S{jv [ 7 } < e/2. Let D' be a 
GHD of G' with MW < 2 — e. Adding Ay to each bag of GHD D' gives us a GHD for G 
that has MW < 2 — e/2. 

So now we can assume that the degree of Ay is < A 1_ J in both its relations. Thus 
S{x s ,Xu,x T ) < 2 — Now like in the previous proof, we will consider every other path from 
A s to At, and construct a GHD with MW < 2 for each path, which has at least one bag 
containing both Ag and AY- Then we can create a new bag {AY, At} and use it to stitch 
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all the GHDs together to get a GHD for G that has MW < 2. We now describe how to 
construct the MW < 2 GHD for each path. 

Consider any other path Xi. X 2 ,..., X n where Xi = Xs, X n = Xt- Let our relations 
in the path be R\{Xi,X 2 ), . ■ . ,R n ~i(X n -iX n ). Let S = N e ^ 2n+4 \ Suppose some X, has 
degree > <5 in relation Ri. Choose the smallest such i, (so for all j < i, the degree of Xj in Rj 
is < 5). Then we form a GHD with one bag {X n , X\, X 2 , ■.., Xi}, and also a bag {Xi} U Rj 
for each j > i. The m of the first bag is log(iV 2 - i5 Z ) (because rri{x„,. xy} < log(iV 2_ i) 
and each of X 2 ,..., X, adds log(<5) to it). From the definition of 5, we have N 2 ~i5 l < 
iV 2 <5 —1 . The m of other bags is log(iV 2 (j _1 ), since m{x ,x j+ 1 } < log(iV) and X, adds at most 
log(iV<5 —1 ). Thus the MW of the path GHD is < 2 — log(<5). On the other hand, if no X t has 
degree > S in any R i: then a single bag {X\,X 2 ,... ,X n } has m < log(N 2 ~i S n ~ 2 ), which 
gives us a GHD for the path with MW < 2. 

Thus for each degree configuration of G, we can construct a GHD with MW < 2, which 
implies that the to- width of the join over G is < 2. ◄ 

E.5 Comparison to other widths 

Theorem |23| implies that m -width is no larger than fractional hypertreewidth (and con¬ 
sequently, no larger than treewidth and generalized hypertreewidth). to- width can even be 
smaller than submodular width (which, ignoring ?n-width, is the tightest known notion of 
width for general joins), as shown in the Example below. 

► Example 40. Consider a cycle join with n relations, with each relation having size N and 

all degrees being equal to 1 in each relation. Then the TO-width of the join is given by 1 
(because all the d(A, B , R) values in Linear Program [2] are 0 for A ^ 0 and 1 if A = 0). On 
the other hand, the submodular width of this join is 2 — . 

Similarly, if we consider a clique join with n attributes (i.e. for each pair of attributes, 
there is a single relation with N tuples), and all degrees are 1 in each relation, then the 
TO-width of the join is 1, while the submodular width is nj 2, which can be unboundedly 
larger. 

The above examples rely on the fact that TO-width takes actual degrees of the relations 
following degree-uniformization into account, while submodular width uses worst-case degrees. 
In addition, whenever tha happens to be a submodular function over A, TO-width is guaranteed 
to be < submodular width. Unfortunately, itia is not always submodular, as shown by the 
example below: 

► Example 41. Consider a join with relations R(A,B), S(B,C), T(B), U(C). Let |I?| = 
IS) = N, |T| = |Z7| = y/N- And let the degree of each A value in each relation be y/N (so 
there are y/~N distinct A values), while the degrees of B and C values are 1 (so there are 
N distinct B , C values in R, S and \/~N values in T, U. Now we compute the m values for 
different sets. 

Since there are N B, C values in relations R , S, but only y/~N B , C values in relations T 
and U, we have TO{bj = TO{c} = log(v^V : ), and m^A} is log {VN) as well. Now for ui{a,b }> 
we have S{a,b} < S{#} + d({B}, {A, B}, R). Since the degree of B is 1, d({B}, {A, B}, R) 
is 0, which gives us m{A,B] = log (VN) as well. Similarly, TO{a,c} = log {y/N). Finally, we 
have msB,c} = wim.b.c} = log(AT). Thus we have m^A} + ^{a.b.c} = log (Ny/N), while 
rrisA,B} + m {A : C} = log(iV), which implies that m is not submodular. 

The above example gets to the heart of why our degree uniformization is weaker than 
Marx’s uniformization (while being less expensive). Our degrees are uniform within relations, 
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but not necessarily in the final output. For example, each A value has degree \J IV in the 
relations, but because only y/N out of N B and C values will be in the output, the degree of 
an A value in the output can range anywhere from 1 to y/N . Marx’s uniformization ensures 
that degrees are uniform in certain projections of the output as well. 

Even though we started with y/N values of A each having degree y/N, once most of the 
B and C values are eliminated due to relations T, U, both the number of matching A values 
and their degrees are reduced. The number of A values that still have degree y/N can now 
be at most 1 (since there are y/N values of B, C left). This change in the number of values 
is not taken into account in our s values. One naive way to remedy this is to repeatedly 
perform degree-uniformization after every step of the join, but this can lead to a higher than 
linear cost. 


E.6 Relating subquadratic solvability to widths 


Each graph that we showed to be subquadratically solvable has m-width < 2 (and also 


submodular width < 2). Moreover, the 3-SUM hard 1-series-parallel graph from Theorem 20 
can be shown to have m-width and submodular width equal to 2. We show this next. 

The graph has edges AgAUq, Xa 1 X Bi , X Bl X T , XgX a 2 , Xa 2 X B2 , X B 2 X t , X b Xa 3 , 
Xa 3 X B3 , X Ba X t- Then we give a edge-dominated submodular function / such that for any 
GHD, there must exist a bag \(f) such that /(%(f)) > 2. Suppose there are N values in 
X 5 , Xt with degree 1 in each relation, and /N values in other attributes with degree y/N 
in each relation. Then the m values for this join happen to be submodular. Specifically, 


we have ui{.Y s } = m{x T } = 1> and for all i, we have m.{ Xs ,Ai} m {x T ,Bi} = U m {A;} 
= m {Bi} = 0.5, m{ Ai ,Bi} = U m {Xs,Bi} = m{x T ,Ai} = Bfl{Xs,Ai,Bi} = Bfl{x T ,A x ,B x } = 1 -5, 
m{x s ,A,,B i ,x T } = , m{x s ,B i ,x T } = m{ Xs ,A i ,x T } = m {x s ,x T } = 2. Moreover, for all i, j / i, 
if Pi = {X s ,Ai,Bi,X T }, Pj = {X s ,Aj,Bj,X T }, and P C Pi U Pj then m. P = m PnPi 
+m P n Pj —m p n Pi n Pj - wp for P U Pj U Pj U Pj- can be found similarly. 

Now any GHD that puts Xs and X p together must have width 2 since nri{x s ,x T } = 2. 
But if Xs and X p never occur together, then the path between their nodes in the GHD must 
contain each of the paths in the graph ({AgA,;, AiBi, BiX p } for all ?'). Thus each node in 
the path must contain at least one node from each path, and at least of them must contain 
the edge A\Bi. This means that at least one node in the GHD must contain four of the AiS 
and BiS combined, which again makes the width 2. This shows that the submodular width 
of the 3-SUM hard graph is 2. 

This may suggest that a join can be solved subquadratically if and only if its submodular 
width is < 2. However, this is not the case. In fact, submodular width is not the a tight 
lower bound on the runtime exponent. As a counterexample, a triangle join has submodular 
width equal to 3/2. But when output size is small, a triangle join can be computed in time 
IN 4 / 3 1 9 . This triangle computation algorithm uses matrix multiplication as a subroutine, 
and makes use of the fact that the matrix multiplication exponent w is < 3 (The matrix 
multiplication exponent to is defined as the smallest value such that two dense N x N matrices 
can be multiplied in time 0(N W )). As another example, the graph with edges XY t , XY 2 , 
Y i Z 1 , Y 2 Z - t , XY 3 , XY 4 , Y 3 Z 2 , Y 4 Z 2 , ZiZ 2 can also be shown to have submodular width 2. 
But we can compute its join in subquadratic time, again by using matrix multiplication in 
combination with the DARTS algorithm. 


► Theorem. Consider a graph with edges XY \, XY 2 , Y\Z\, Y 2 Z\, XY 3 , XY±, Y 3 Z 2 , Y 4 Z 2 , 
Z\Z 2 • A join over the graph can be solved in subquadratic time when output size is small. 

Proof. (Sketch) We briefly describe the transforms used to reduce the above join. First, if 
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X has degree N e for any e > 0, then a heavy transform reduces the join to an acyclic one, 
which means we can process the join in time 0(N 2 ~ e + OUT). So assume that X has small 
degree. 

Then we perform a light transform on {X , Yj, Y2, Y 3 , Y4}, which gives a single relation 
of size Rj N (since A' has low degree). Then we use a split transform to remove X , and we 
are left with edges Y 1 Y 2 Y 3 Y 4 , Y\Z\, Y 2 Z\, Y 3 Z 2 , Y 4 Z 2 , Z 3 Z 2 , all of size N. 

Now, if either Z\ or Z 2 has degree > 7V°- 5+e , we do a heavy transform on it, reducing 
the problem to a triangle join which can be solved in time N 3 / 2 . In fact, if the degree of Z\ 
is more than d x 7V e , while that of Z 2 is d for any d, then we can do a heavy transform on 
Z 1 , and the number of triangles for Z 2 is bounded by Nd, which gives us subquadratic time. 
So now we can assume that the degrees of Z\ and Z 2 are almost equal, and less than \fN. 

But if the degrees of Z\ and Z 2 are less than iV°- 25_e each, then a light transform on 
all attributes gives us an output with size < _/V 2-4e (as each Z\,Z 2 has at most _/V 1-4e 
quadruples of neighbors.) So assume the degrees of Z\, Z 2 are almost equal and between 
TV 0 ’ 25 and TV 0 ' 5 . 

If the degrees of Z\, Z 2 are given by d < iV°- 5_e , then we perform light transforms on 
{Z 1 ,Yi,Y 2 } and {Z 2 ,Y 3 ,Y 4 j, to get two triangles that have < Nd tuples each. Then we 
perform a Split transform using articulation set {Zi,Y 3 ,Yi}. We can compute the join on 
attributes Z- t and all the Y ’s in time N 2 ~ d as there are 7V 1-d Z 3 values and N values of 
the U’s. Thus the size bound on the projection onto {Z 1 ,Y 3 ,Yi} is also N 2 /d. Then we 
can compute the join for Z 3 , Z 2 , I 3 , Y 4 in time Nd 2 since there are Nd values of Z 2 Y 3 Y 4 , 
and each Z 2 value has at most d neighbors in Z 3 . Thus we can solve this join in time 
Nd 2 < N 2 ~ 2e . 

Now finally, assume that value in Z±, Z 2 both have degree d = N 0 - 5 . Like in the previous 
case, we perform a split transform on Z\, Y 3 , Y 4 and compute the join of Z\ with all Y’s 
and their projection onto Z 3 Y 3 Yi in time N 2 /d = N 3 / 2 . But the other remaining join has 
relations Z-\ Y 3 Y 4 , Z 2 Y 3 Yi and Z\Z 2 of sizes TV 3 / 2 , AT 3 / 2 , N respectively. We have N 1 / 2 
values in Z 3 , Z 2 and N values in Y3Y4. We can convert Y 3 , Y 4 into a single attribute with 
N values to get a triangle join. Then we can randomly divide the N values of Y3Y4 into 
y/~N sets, to get y/N triangle joins (of three relations of size N each). This is where we use 
matrix multiplication. Using the multiplication multiplication based algorithm for triangle 
finding ( 9 ], we can solve each triangle join in time strictly less than TV 3 / 2 when OUT is small. 
Then we can combine the solutions from the y/N triangle joins, and the total time taken is 
strictly less than TV 3 / 2 x yfN = N 2 . The proves that the join can be solved in subquadratic 
time. ◄ 

F DBP Bound and Parallel Processing 
F.l Intuition behind the DBP bound 

The intuition behind the DBP bound is clearer when we use the dual version of Linear 
program [3] 

► Linear Program 4. (Dual of Linear Program [3]) 

Maximize ^ wr,A' log (■ jrA ^ R ^’ A \ A j s.t. Va € A : ^ wr,A' < 1 

(i?, J 4 )eC,A'CA ^ ' ( R,A)£C,A'CA\a£A' 

Linear program [4] is structurally similar to an edge packing program. In edge packing we 
assign a non-negative weight to each edge such that the total weight on each attribute is < 1, 



M. R. Joglekar and C. M. Re 


33 


while maximizing the sum of all weights (weighted by log of the relation sizes). The linear 
program for DBP(7\!., 2) can be thought of as a variant of edge packing with the following 
differences: 

• Instead of assigning weights to only relations, we assign weights {wr,A') to subrelations 
7 ta'R as well. 

• We take a minimum over all covers of the join, where covers can consist of relations ( R ) 
or subrelations (tta(R))- 

• The biggest difference is, in edge packing the weight of each edge 7 ta'(R) is multiplied by 
the log of its size. Here, instead of size, we use the maximum number of distinct values in 
tta'(R) that an external value (in tta\a'{R)) can connect to. This in-degree d 7TA ( r),a\a> 
is naturally bounded by the size |7r^'(i?)| but can be smaller for sparse relations. 


F.2 Proof of Theorem [261 

► Theorem. For each degree configuration c £ Cl, the value of IN DBP< - 7 ^ c - ),i ' ) is < to the 
AGM bound on IZ(c). 


Proof. For any relation R £ 7Z(c), and any A C attr(i?), cIr^a denotes the maximum degree 
of any value in A in relation R. d R $ simply equals |f?|. Note that the degree configuration 
c specifies a degree bucket for each (f?, A). Let d' R A denote the minimum degree of that 
bucket. The actual maximum degree d R A may be strictly less than the values in bucket 
Ld' R A because some of the neighbors of values in A in the original relation may not be 
compatible with degree configuration c. The actual degree d^A is also < |7r attr (ij)\,i(-R)|. 
Now we define an effective size S(R,A) for any pair (R,A) inductively: 

. S(R, 0) = 1 


• S(R,A)=max A 'cAS(R,A') x 

If A / 0, then setting A' = 0 in the definition tells us that S(R , A) > S(R , 0) x _ 

This tells us that S(R, A) is lower bounded by the actual size of 7 ta(R) divided by L. 
We can inductively prove an upper bound on S(R,A), by its maximum possible size divided 
by L. Specifically, for A / 0: 


S(R,A) < 


|fl| 

d' R , A L 


This is easily true for singleton As, since their S is simply equal to < J R ^ r . For 

bigger As, we can prove this as follows: Each A' value in the current configuration has at 
most Ld' R A , neighbors in the original R. Each A value in the current configuration has at 
least d' R A neighbours in the original R. Thus, each A! value in the current configuration has 

at most d A’ A ' neighbors in tta(R) in the current configuration i.e. d 7TA ( R y A ' < d *'^’ ■ Now 
in the definition of S(R, A), if A! = 0, then we again get 


S(R, 0) x 


Ga(H),0 


< 1 X 


MR )I / \R\ 


< 




For A! 0, we have 


S(R,A') x 


dn A (R),A> 

L 


|-R| Ld' RA , 

' ' x _ ’ 

dR,A‘R d Ri AL 


|A| 

dR,A 


We prove the result by giving a sequence of linear programs, starting from the dual of the 
fractional cover program (whose optimal objective value equals the log of the AGM bound), 
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and ending with the DBP program (whose optimal objective value equals log of the DBP 

bound), such that the optimal objective value in each step is less than or equal to that in 

the previous step. 

1. To start with, we have the dual of the fractional cover linear program, that assigns a 
non-negative value v a to each attribute a such that for each relations R in the join, the 
sum of values of attributes assigned to that relation is less than log of the relation size 
\R\. The objective is to maximize the sum of the v a s. The optimal objective value for 
this program gives us the AGM bound. 

2. We modify the program to include constraints for subrelations. That is, for each R, for 

each A C attr(f?), we add a constraint saying that the some of values of attributes in 
A must be < log ■ The program is still feasible (since all v a s equal to zero is a 

valid solution), but more constrained than the previous one. Since it is a maximization 
problem, additional constraints can only reduce the optimal objective value. 

3. We reduce the right hand sides of the constraints from to S(R,A). Since S(R,A) < 

for each R, A, the resulting program is strictly more constrained, while still being 
feasible, and hence its optimal objective value is less than or equal to the previous 
program. 

4. Now we actually consider an optimal solution to the linear program. Some of the 
constraints must be tight in the optimal solution. Moreover for each attribute a, there 
must exist a tight constraint (R, A) such that a £ A, because otherwise we could 
increase v a slightly, increasing the objective value, without violating any constaints, which 
contradicts the optimality of our solution. That is, the set of tight constraints (f?, A) 
form a cover of the attributes. Call the cover C. Replace the inequality constraints for 
(. R , A) £ C with equality constraints. The resulting program is more constrained, but the 
previous optimal solution is feasible for this program as well, so it has the exact same 
optimal objective value. 

5. Now for each (R, A) £ C and each A' C A, we have an equality constraint J2 a eA W = 
log(S(i?, A)) and and inequality constraint J^aeA 1 Va — l°g(*S'(-R, Al')). Together, these 

constraints imply Y^aeA\A ,v a > log ^ s(r’a ') ) • Thus, for each (R,A) £ C,A' C A, 
we keep the equality constraint = log(£(■??, A)), but replace J2 a eA' v a — 

log(S(R,A')) with J2aeA\A' v a — ( JjTTTy ) • This gives an equivalent linear pro¬ 

gram, which hence has the same optimal objective as before. Note that by replacing A' 
with A \ A', we can rewrite the above constraint as J^aeA 1 v a > log ( s(fl a\A') ) • 

6 . Now, we keep constraints the same, but try to minimize rather than maximize the 
objective. The resulting program is still feasible, but may have a smaller objective value. 
The value won’t be zero because now we have > log ( s(ra\a ') ) constraints for the 
R, A, A's. 

7. Earlier, we had only changed constraints for R,A,A' where (R, A) belonged to cover 
C and A' was a subset of A (turning then from < constraints to > constraints). Thus, 
from our original dual program, we may have leftover < constraints for A' that are not 
the subset of any A in the cover. We drop these constraints. The resulting problem is 
now less constrained than earlier, and since it is a minimization problem, the resulting 
objective can only be smaller. 

8 . For A' C A, the inductive definition of S tells us that s^R R i\A') — d, ' AlRRA '' A ' . We 

change the RHS of the R, A, A! constraints from log ( s(r R a\a ') ) ^ dwA{R ^’ AXA ' ^ . 

This only loosens the constraints. For each (i?, A) £ C, we currently have an equality 
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constraint YlaeA v a = log(>S(.R, -A)). We use the known lower bound on S(R, A) to replace 
the equality constraint by J2 a eA v a > log Tins also loosens the constraints. 

Note that since d nA ^ t i = | 7 ta(-R)|, this constraint is actually now a special case of the 
constraints with R,A,A'. Since both the above steps loosen the constraints, this can 
only decrease the optimal objective value. 

9. The resulting linear program can be seen to be the program used to define DBP, with 
an extra j, factor in the RHS of each constraint. As L becomes smaller, the optimal 
objective value of the program tends to that of the DBP program. Moreover, since DBP 
itself is a minimum over all covers, while for this program we chose a specific cover, the 
actual DBP is less than the solution to this linear program, which is less than the AGM 
bound. 

This proves the result, as required. ◄ 

If L is less than the size of each relation, and p* is the fractional cover of the join query 
(used in the AGM bound), then in fact DBP(7?.(c), L) < L~ p * AGM. This can be seen by 
replacing the right hand sides of the constraints of the program in step 1 by ^ instead of 
\R\. This reduces the objective value of the original program, and the remaining steps still 
go through. 

F.3 Examples comparing the DBP and AGM bounds 

► Example 42. (Comparison between DBP and AGM) 

Let L = 2 for this example. Consider a triangle join R(X,Y ) x S(Y,Z) x T(Z,X). Let 
\R\ = |Sj = |Tj = N. Let the degree of each value x in X , in R and T be d. For different 
values of d , we will choose a cover C and find the objective value of the linear program for 
that cover. Note that the DBP bound is a minimum over all covers, so it is possible that a 
different cover C* gives an even smaller linear program objective, but the purpose of this 
example is to show that the DBP bound can be much tighter than the AGM bound; hence it 
suffices to show that an ‘upper bound’ on the DBP bound is much tighter than the AGM 
bound. 

Case 1. d < VN: We choose cover C = {(R, {X, Y}), (T, {X, Z})}. For this cover, the 
solution to Linear Program [4] is wr^x,y} = w t,{z} = 1 with all other values set to 0. The 
objective value is log(-ZV) + log(d) = log (Nd). Thus, the DBP bound is < Nd, which tells us 
that join output size is upper bounded by Nd. 

Case 2. d > \/N: Since d is large, the number of distinct X values must be small. 
To take advantage of this, we consider cover C' = {(!?, {X}), (S, {Y. Z})}. Now the linear 
program solution is trivially Wrix} = ®s,{y,z} = 1 , which gives us the join size bound of 

IT ( since d R.{x} < |ty(A)| < f )• 

In contrast, the AGM bound gives us a loose upper bound of JVs irrespective of degree d. 
Computing the AGM bound individually over each degree configuration does not help us do 
better, as the above example can have all tuples in a single degree configuration. 

► Example 43. As suggested by the above example, the DBP bound has a tighter exponent 
than the AGM bound for almost all possible degrees (namely, degrees higher or lower than 
y/N). As a more general example, suppose we have a join consisting of binary relations of 
size N each, where each value has degree d, where the join hypergraph is connected. Then 
the AGM bound on this join will equal the DBP bound only when d ~ y/N. If d < VN , 
then the DBP bound will be smaller than the AGM bound by a factor of at least Ni. 
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To show this, consider a traversal of the join hypergraph A'i, X 2 , ■ ■ ■, X„ such that 
R(X i,X 2 ) is a relation in the join, and for all i > 2, there is a j < i such that X 7 -, X. ( is a 
relation (call it R(i)) in the join. Then consider cover C = {(f?, {Xi, X 2 })} U {(i?(i), {X;}) | 
i > 2}. The solution to the linear program is wr'.a' = 1 for all ( R',A') £ C and 0 otherwise. 
This gives us a bound of N x d) 0 ^ 1 = Nd n ~ 2 . In contrast, if we have n attributes, the AGM 
bound must be at least y/N (which is actually achieved if all attributes have y/~N values 
and all relations are full cartesian products). Thus the ratio of the AGM bound to the DBP 
bound is at least (^yp) ra_2 > y/~N = Ni. 

On the other hand, d cannot be > y/~N + for all values, because if it is (say in relation 
R(X, Y), then the number of values in attribute A' must be 0(y/N ) which is this smaller 

than the degree of values in Y. 

F.4 Proof of Lemma l28l 

► Lemma. The shares algorithm, where each attribute a has share IN““, where v a is from 
the solution to Linear Program [3j has a load of 0{L) per processor with high probability, 
and a communication cost of 0 (max ce c L L • IN 06 ^ 7 ^ 0 -*’^). 

Communication: Consider any ( R , A) £ C. As per the shares algorithm, every tuple in 
tta(R) will have to be sent to every processor whose hash value in A matches that of the 
tuple. Thus, the number of processors to which each tuple is sent is given by II a ^ J 4 lN 1 ' a . 
Thus, total communication for R, A is given by 

|7T A (i?)| xn o ^IN u “ <L-m£«eA Va xIN ^’' 0 = L ■ I N dbp(r ( c )’ 1 ) 

Thus, total communication is bounded by L ■ IN 06 ^ 7 ^ 0 )’ 7 ^ (multiplied by some factors 
that depend on the number of relations and schema sizes, but not on the number of tuples 
in the relations). 

Load: Now we analyze load per processor. We will show that the rn th moment of load on 
a processor is 0(L m ), which shows that the load is O(L) with high probability, ignoring 
factors not depending on IN. Consider an (I?, A) £ C, and a processor with hash value hi 
for A and /12 for remaining attributes. Each tuple of tta(R) will be sent to this processor if 
its hash on A equals h\. For any value x £ 7 ta{R), let I x be an indicator variable thats true 
if the hash of x equals h\. Then expected load on the processor from (R,A) is 

E [Load] = £[4] < L ■ IN^a v - x IN^^a = L 

X£7Ta(R) 

Now let us consider the m th moment of the load. Consider m tuples t\, £ 2 , ■ • •, t m € tta{R)- 
Each tuple specifies a value in each attribute in A. Some of these values may be equal 
to each other. For example, for tuples (x,y) and (x,y r ), the first value is equal. We 
are going to count the number of m-sized sets of tuples with the same pattern of equal 
values, and the probability of all these tuples being sent to the processor and show that 
it is 0(L m . Define T; for 1 < l < m to be the set of attributes whose values in ti 
occur in ti but not in t\, ti ,..., ti-\. For instance, if R had schema (A, Y, Z) and we had 
tuples ti = (x 1 ,y 1 ,z 1 ),t 2 = (xi, 2 / 2 , z 2 ), t 3 = (x 2 , 2 / 2 , Zi), U = {x 2l y 3 ,z 3 ), then we would 
have Tj = {A ,Y,Z},T 2 = {Y, Z} ,T 3 = {A'},T 4 = {Y,Z}. T± is always equal to A by 
this definition. The probability of all these tuples being hashed to a given processor is 
IN Va ). The number of such tuple sets is upper bounded by Ll]T 1 d WA (/{) jJ 4 \Ti, 
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since the number of ways of choosing ti such that its A\Ti part is fixed, is d Wj 4 (_R). J 4 \r i • Thus, 
the m th moment of the load is: 


mUdn A (R),A\T t X n^cn^iN-^) < L m ur = 1 iN^-i Va X n^cn^iN-^) = l * 1 


Thus, the m th moment of load is 0(L m ), and so the load per processor is 0(L) with high 
probability, ignoring terms not depending on IN. 


F.5 Additional Examples for the parallel algorithm 

► Example 44. Generalizing the previous example, let the degree of each value be 0(5), 
where 5 < y/N. Let p be the required number of processors at load level L. 

• If L < 5, then p = DBP(7?., L) = jj. 

• If 5 < L < f, then p = DBP(77, L) = f. 

• If f < L < N, then p = DBP(ft, L) = 1. 

Now we invert the above analysis to see how changing the number of processors p changes 
load L. When p = lwe have L = NS -1 . As p increases up to NS -1 , the load is Np~ x . So 
as long as p < NS -1 , we get optimal parallelism. Beyond that, as p increases to NS, load 
decreases as yjN5p~ l . Thus, beyond NS -1 , doubling p gives us only a y/2 reduction in load. 
Finally, when p = NS, the load becomes 0(1), which is the maximum parallelism level. 

► Example 45. In this example, we demonstrate that our parallel algorithm can even 
outperform existing optimal sequential algorithms : 

Consider a triangle join 1Z = {Ri(X,Y), R 2 (Y, Z), R 3 (Z, X)}. Let |i?!| = |i? 2 | = I-R 3 I = 
N. Also suppose \Z\ = N, |X| = |Y| = y/N, and the degrees of all z £ Z are 0(1) while 
degrees of values in X,Y are 0(y/N). The DBP bound on the join is O(N). Running 
worst case optimal algorithms like NPRR and LFTJ take time N s to process the join if the 
attribute order is X, Y, Z or Y, X, Z. On the other hand, a simple sequentialized version of 
our parallel algorithm takes time O(N). By concatenating three such joins, with a different 
attribute being the sparse attribute each time, we get a join for which NPRR takes time 
for all attribute orders, while our sequentialized parallel algorithm takes time 0(N). 

Note that using GHD based algorithms (that have runtime C^IN-^ 1 " + OUT)) does not 
improve the 0(N%) runtime, as all three relations must be in a single bag of the GHD. 



